Jayanth Boddu
Clustering
An NGO wants to use their funding strategically so that they could aid five countries in dire need of help. The analyst's objective is to use clustering algorithms to group countries based on socio-economic and health factors to judge the overall development of countries. Further,the final deliverable is - suggesting 5 countries that need the aid the most.
Feature | High / Low |
---|---|
Child Mortality | High |
Life Expectancy | Low |
Fertility | High |
Health Spending | Low |
GDP per capita | Low |
Inflation Index | High |
Income per Person | Low |
Imports | High |
Exports | Low |
A new feature 'Trade Deficit' has been derived.
Univariate analysis revealed the following information
Feature | Highest | Lowest |
---|---|---|
Child Mortality | Haiti,Sierra Leone | Iceland |
Life Expectancy | Japan , Singapore | Haiti, Lesotho |
Total fertility | Chad, Niger | Singapore, South Korea |
Health Spending | Switzerland, US | Madagascar, Eritrea |
GDP per capita | Luxemborg , Norway | Burundi , Liberia |
Inflation Index | Nigeria, Venezuela | Seycelles, Ireland |
Net Income per person | Qatar, Luxemborg | Liberia, Congo Dem. Rep. |
Imports per capita | Luxemborg, Singapore | Myanmar, Burundi |
Exports per capita | Luxemborg, Singapore | Myanmar, Burundi |
Trade Deficit per capita | Bahamas, Greece | Luxemborg, Qatar |
Feature | Upper Outliers | Lower Outliers |
---|---|---|
Child Mortality | Not Changed | Capped |
Life Expectancy | Capped | Not Changed |
Fertility | Not Changed | Capped |
Health Spending | Capped | Not Capped |
GDP per capita | Capped | Not Changed |
Inflation Index | Not Changed | Capped |
Income per Person | Capped | Not Changed |
Imports | Capped | Not Changed |
Exports | Capped | Not Changed |
Columns : 'child_mort', 'exports', 'health', 'imports', 'income','inflation', 'life_expec', 'total_fer', 'gdpp','trade_deficit' were used for clustering.
Hopkin's Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%
Cluster | GDP | Income | Child Mortality |
---|---|---|---|
0 | Very low | Very Low | Very High |
1 | low | low | low |
2 | Moderate | Moderate | Moderate |
3 | High | High | Low |
4 | Very High | Very High | Low |
5 | Very Low | Very Low | Low |
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
import warnings
warnings.filterwarnings('ignore')
!pip install tabulate
!jt -f roboto -fs 12 -cellw 100%
from tabulate import tabulate
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure,show,output_notebook,reset_output
from bokeh.transform import factor_cmap
from bokeh.layouts import row
output_notebook()
# to table print a dataframe
def tab(ser) :
print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt='psql'))
countries = pd.read_csv('./Country-data.csv')
tab(countries.head())
exports
, imports
,health
as a proportion of GDP per capita# conversion to actual values
countries['exports'] = 0.01 * countries['exports'] * countries['gdpp']
countries['imports'] = 0.01 * countries['imports'] * countries['gdpp']
countries['health'] = 0.01 * countries['health'] * countries['gdpp']
tab(countries.info())
# taking a closer look to see if any country is duplicated
countries[countries.duplicated(subset=['country'])].index.values
# exports, health, imports are a percentage of GDPP . Checking if there are any anomalies
condition1 = countries['health'] < countries['gdpp']
condition2 = countries['imports'] < countries['gdpp']
condition3 = countries['exports'] < countries['gdpp']
# countries which don't satisfy the above conditions
index = countries[~(condition1 & condition2 & condition3)].index.values
countries.loc[index]
# child mortality rate is calculated for 1000 live births. Let's check if its above 1000
countries[countries['child_mort'] > 1000].index.values
# summary statistics
tab(countries.describe())
# let's look at quantiles are to take a closer look at outliers
tab(countries.quantile(np.linspace(0.75,1,25)).reset_index())
exports
,child_mort
, imports
, income
, inflation
, life_expec
, total_fer
and gdpp
have outliers. # function to perform outlier analysis
def outlier_analysis(column) :
'''
This function prints a violin plot and box plot of the column provided.
It also prints the five major quantiles, lower oultier threshold value, upper outlier threshold value, tables of countries which are outliers
Output : lower outlier threshold condition, upper outlier threshold condition
Input : column name
Side effects : Violin plot, box plot, outlier tables
'''
plt.figure(figsize=[12,6])
plt.subplot(121)
plt.title('Violin Plot of '+column)
sns.violinplot(countries[column])
plt.subplot(122)
plt.title('Box Plot of '+column)
sns.boxplot(countries[column])
print('Quantiles\n')
print(tab(countries[column].quantile([.1,0.25,.50,0.75,0.99])))
lower_outlier_threshold = countries[column].quantile(0.01)
upper_outlier_threshold = countries[column].quantile(0.99)
print('\n\nLOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR' ,column,': ',lower_outlier_threshold)
l_condition = countries[column] < lower_outlier_threshold
l_outliers = countries[l_condition][['country',column]].sort_values(by=column)
if l_outliers.shape[0] :
print('\n\nLower Outliers : ')
tab(l_outliers)
else :
print('No lower outliers found in ' + column)
print('\n\nUPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR' ,column,': ',upper_outlier_threshold)
u_condition = countries[column] > upper_outlier_threshold
u_outliers = countries[u_condition][['country',column]].sort_values(by=column)
if u_outliers.shape[0] :
print('\n\nUpper Outliers : ')
tab(u_outliers)
print('\n\n')
return l_condition, u_condition
# Countries with oultiers in child mortality
column = 'child_mort'
l_condition, u_condition = outlier_analysis(column)
# Removing countries with lower outliers in `child_mort`
countries.loc[l_condition, column] = countries[column].quantile(0.01)
# LIFE EXPECTANCY
column = 'life_expec'
l_condition, u_condition = outlier_analysis(column)
# Capping upper outliers in life expectancy
countries.loc[u_condition, column] = countries[column].quantile(0.99)
column = 'total_fer'
l_condition, u_condition = outlier_analysis(column)
# capping lower outliers in fertility
countries.loc[l_condition,column] = countries[column].quantile(.1)
# health spending per capita
column = 'health'
l_condition, u_condition = outlier_analysis(column)
# removing upper outliers in health spending
countries.loc[u_condition,column] = countries[column].quantile(0.99)
# GDP per capita
column = 'gdpp'
l_condition, u_condition = outlier_analysis(column)
# capping upper outliers in gdpp
countries.loc[u_condition, column] = countries[column].quantile(.99)
column = 'inflation'
l_condition, u_condition = outlier_analysis(column)
# capping lower outliers in inflation
countries.loc[l_condition, column] = countries[column].quantile(0.01)
# income
column = 'income'
l_condition, u_condition = outlier_analysis(column)
column = 'imports'
l_condition, u_condition = outlier_analysis(column)
column = 'exports'
l_condition, u_condition = outlier_analysis(column)
# trade deficit
countries['trade_deficit'] = countries['imports'] - countries['exports']
column = 'trade_deficit'
l_condition, u_condition = outlier_analysis(column)
# capping lower outliers of trade_deficit
countries.loc[l_condition,column] = countries[column].quantile(0.01)
# Pair Plots of all variables
sns.pairplot(countries[['child_mort', 'exports', 'health', 'imports', 'income',
'inflation', 'life_expec', 'total_fer', 'gdpp']]);
## bivariate analysis boken function
def bivariate_analysis(x_var,y_var,dataframe=countries) :
# Bivariate Plots with tooltips : country,x,y
dataframe = dataframe.copy()
source =ColumnDataSource(dataframe)
# pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']
# tooltips
tooltips1 = [
("Country", "@country"),
(x_var,'@'+x_var),
(y_var,'@'+y_var)
]
p = figure(plot_width=420, plot_height=400,title='Scatter Plot : '+ x_var+' vs '+y_var, tooltips=tooltips1)
p.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = "#3498db")
p.xaxis.axis_label = x_var
p.yaxis.axis_label = y_var
reset_output()
output_notebook()
show(p)
# CHILD MORTALITY vs LIFE EXPECTANCY
x='child_mort'
y = 'life_expec'
bivariate_analysis(x,y)
child_mort
and life_expec
are almost linearly related.# Countries with high child mortality and low life expectancy
life_expect_cond = countries['life_expec'] < 60
child_mort_cond = countries['child_mort'] > 100
print('Countries with low life expectancy and low Child Mortality Rate')
tab(countries[life_expect_cond & child_mort_cond][['country','child_mort','life_expec']].sort_values(by=['child_mort','life_expec'], ascending=[False,True])[:10])
# CHILD MORTALITY vs TOTAL FERTILITY
x='child_mort'
y = 'total_fer'
bivariate_analysis(x,y)
# countries with high fertility and child mortality rate
total_fer_cond = countries['total_fer'] > 6
child_mort_cond = countries['child_mort'] > 100
print('Countries with High Total Fertility and Child Mortality Rate')
tab(countries[total_fer_cond & child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))
# Fertility in countries with extremely high mortality rate
child_mort_cond = countries['child_mort'] > 150
print('Countries with High Total Fertility and Child Mortality Rate')
tab(countries[child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))
# GDP per capita vs Health Spending
y='health'
x= 'gdpp'
bivariate_analysis(x,y)
# Countries with Low GDPP and health spending might need aid
health_cond = countries['health'] < 200
gdpp_cond = countries['gdpp'] < 2500
print('Countries with Low GDP and low Health Spending')
tab(countries[health_cond & gdpp_cond][['country','health','gdpp']].sort_values(by=['health','gdpp'], ascending=[True,True])[:10])
# IMPORTS vs GDPP
y='imports'
x= 'gdpp'
bivariate_analysis(x,y)
# Exports vs GDPP
y='exports'
x= 'gdpp'
bivariate_analysis(x,y)
y='trade_deficit'
x= 'gdpp'
bivariate_analysis(x,y)
# Countries with positive high trade deficit and low GDP per capita
tr_deficit_cond = countries['trade_deficit'] > 1000
gdpp_cond = countries['gdpp'] < 10000
print('Countries with high Trade Deficit and Low GDP')
tab(countries[tr_deficit_cond & gdpp_cond][['country','gdpp','trade_deficit']].sort_values(by=['trade_deficit','gdpp'], ascending=[True,False])[:10])
# Income vs Inflation
x = 'inflation'
y = 'income'
bivariate_analysis(x,y)
# Low Income - High inflation countries might require aid
inflation_cond = countries['inflation'] > 20
income_condition = countries['income'] < 10000
print('Countries with high inflation and low income')
tab(countries[inflation_cond & income_condition][['country','inflation','income']].sort_values(by=['inflation','income'], ascending=[False, True]))
x = 'gdpp'
y = 'inflation'
bivariate_analysis(x,y)
plt.figure(figsize=[12,12])
sns.heatmap(countries.corr(),annot=True,cmap='YlGnBu', center=0)
Top Correlations
life_expec
and child_mort
total_fer
and child_mort
Although clustering analysis is not affected by multicollinearity, this plot shows us the possible linear relationships between different features to help with results obtained from cluster analysis.
# hopkins test function
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
from math import isnan
def hopkins(X):
d = X.shape[1]
#d = len(vars) # columns
n = len(X) # rows
m = int(0.1 * n)
nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
rand_X = sample(range(0, n, 1), m)
ujd = []
wjd = []
for j in range(0, m):
u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
ujd.append(u_dist[0][1])
w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
wjd.append(w_dist[0][1])
H = sum(ujd) / (sum(ujd) + sum(wjd))
if isnan(H):
print(ujd, wjd)
H = 0
return H
## Data used for Clustering
columns_for_clustering = ['child_mort', 'exports', 'health', 'imports', 'income',
'inflation', 'life_expec', 'total_fer', 'gdpp', 'trade_deficit']
clustering_data = countries[columns_for_clustering].copy()
# Hopkin's test
n = 10
hopkins_statistic = []
for i in range(n) :
hopkins_statistic.append(hopkins(clustering_data))
print('Min Hopkin\'s Statistic in ',n,'iterations :', min(hopkins_statistic))
print('Max Hopkin\'s Statistic in ',n,'iterations :', max(hopkins_statistic))
print('Mean Hopkin\'s Statistic in ',n,'iterations :', np.mean(hopkins_statistic))
print('Std deviation of Hopkin\'s Statistic in ',n,'iterations :', np.std(hopkins_statistic))
Since hopkin's statistic is greater than 80% , the data shows good clustering tendency
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
clustering_data[columns_for_clustering] = scaler.fit_transform(clustering_data[columns_for_clustering])
tab(clustering_data.describe())
# Plotting Elbow curve of Sum of Squared distances of points in each cluster from the centroid of the nearest cluster.
ssd = []
range_n_clusters = np.arange(2,9)
for num_clusters in range_n_clusters :
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(clustering_data)
ssd.append(kmeans.inertia_)
plt.plot(range_n_clusters,ssd)
plt.title('Elbow Curve');
plt.xlabel('No of clusters');
plt.ylabel('SSD');
from sklearn.metrics import silhouette_score
no_of_clusters = np.arange(2,10)
score = []
for n_cluster in no_of_clusters :
kmeans = KMeans(n_clusters=n_cluster, init='k-means++')
kmeans = kmeans.fit(clustering_data)
labels = kmeans.labels_
score.append(silhouette_score(clustering_data,labels))
plt.title('Silhouette Analysis Plot')
plt.xlabel('No of Clusters')
plt.ylabel('Silhouette Score')
plt.plot(no_of_clusters, score);
print(score)
# k - means clustering algo with k = 4
n_cluster = 4
kmeans = KMeans(n_clusters=n_cluster, init='k-means++', random_state = 100)
kmeans = kmeans.fit(clustering_data)
labels = kmeans.labels_
countries['k_means_cluster_id'] = labels
# Countries in each Cluster - k means
for cluster_no in range(n_cluster) :
condition = countries['k_means_cluster_id'] == cluster_no
print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
# Agglomerative Single Linkage
mergings = linkage(clustering_data,method='single',metric='euclidean')
plt.figure(figsize=[16,10])
plt.title('Single Linkage - Hierarchical Clustering')
dendrogram(mergings);
# Complete Linkage
mergings = linkage(clustering_data,method='complete',metric='euclidean')
plt.figure(figsize=[16,10])
plt.title('Complete Linkage - Hierarchical Clustering')
dendrogram(mergings);
# Using Complete Linkage, cutting the tree for 6 clusters
n_clusters = 6
cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
countries['hac_complete_cluster_id'] = cluster_labels
# Countries in each Cluster - Hierarchical - Complete Linkage
for cluster_no in range(n_clusters) :
condition = countries['hac_complete_cluster_id'] == cluster_no
print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
# Average Linkage
mergings = linkage(clustering_data,method='average',metric='euclidean')
plt.figure(figsize=[16,10])
plt.title('Average Linkage - Hierarchical Clustering')
dendrogram(mergings);
# Using Average Linkage, cutting the tree for 6 clusters
n_clusters = 6
cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
countries['hac_average_cluster_id'] = cluster_labels
# Countries in each Cluster - Hierarchical - average Linkage
for cluster_no in range(n_clusters) :
condition = countries['hac_average_cluster_id'] == cluster_no
print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
## HAC Clustering : Dissimilarity Measure : Correlation
hac_correlation_mergings = linkage(clustering_data,method='complete', metric='correlation')
plt.figure(figsize=[12,12])
plt.title('Hierarchical Clustering : Complete Linkage, Correlation Measure')
dendrogram(hac_correlation_mergings);
n_clusters = 6
labels = cut_tree(hac_correlation_mergings, n_clusters=n_clusters)
countries['hac_correlation_cluster_id'] = labels
# HAC clustering : Correlation measure : complete distance : Countries in each cluster
for cluster_no in range(n_clusters) :
condition = countries['hac_correlation_cluster_id'] == cluster_no
print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
# Performing k-means using results of Hierarchical clustering
# 1. No of clusters of Hierarchical Clustering
# 2. Centroids obtainded from Hierarchical Clustering as the initialization points.
clustering_data['k_means_cluster_id'] = countries['k_means_cluster_id']
clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
columns = columns_for_clustering.copy()
columns.extend(['hac_correlation_cluster_id'])
centroids = clustering_data[columns].groupby(['hac_correlation_cluster_id']).mean()
n_clusters = 6
mixed_kmeans = KMeans(n_clusters=6 , init = centroids.values, random_state=100)
results = mixed_kmeans.fit(clustering_data[columns_for_clustering])
countries['mixed_cluster_id'] = results.labels_
# Mixed clustering : Euclidean measure : k-means : Countries in each cluster
for cluster_no in range(n_clusters) :
condition = countries['mixed_cluster_id'] == cluster_no
print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
# silhouette scores of all the methods
print('Mixed Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['mixed_cluster_id']))
print('K-means Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['k_means_cluster_id']))
print('Hierarchical Correlation Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['hac_correlation_cluster_id']))
# Clustering Profling - Plots using Bokeh
cluster_id_column = 'hierarchical-c-link-cluster-id'
title="Hierarchical Clustering"
def cluster_analysis_plot(cluster_id_column,title,x_var='income',y_var='child_mort',z_var='gdpp',dataframe=countries) :
# Plots
# works upto 6 clusters
dataframe = dataframe.copy()
dataframe[cluster_id_column] = dataframe[cluster_id_column].astype('str')
source =ColumnDataSource(dataframe)
cluster_ids = sorted(dataframe[cluster_id_column].unique())
# pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']
mapper = factor_cmap(cluster_id_column,palette=pallete[:len(cluster_ids)], factors = cluster_ids)
# plot 1
tooltips1 = [
("Country", "@country"),
(z_var,'@'+z_var),
(x_var,'@'+x_var),
('Cluster', '@'+cluster_id_column)
]
p = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+x_var, tooltips=tooltips1, toolbar_location=None)
for num,index in enumerate(cluster_ids) :
condition = dataframe[cluster_id_column] == index
source = dataframe[condition]
p.scatter(x=z_var,y=x_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1, legend_label=index)
p.xaxis.axis_label = z_var
p.yaxis.axis_label = x_var
p.legend.click_policy="mute"
# ----------------------------
#Plot 2
tooltips2 = [
("Country", "@country"),
(z_var,'@'+z_var),
(y_var,'@'+y_var),
('Cluster', '@'+cluster_id_column)
]
q = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+y_var, tooltips=tooltips2, toolbar_location=None)
for num,index in enumerate(cluster_ids) :
condition = dataframe[cluster_id_column] == index
source = dataframe[condition]
q.scatter(x=z_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1, legend_label=index)
q.xaxis.axis_label = z_var
q.yaxis.axis_label = y_var
q.legend.click_policy="mute"
# ----------------------------
#Plot 3
tooltips3 = [
("Country", "@country"),
(x_var,'@'+x_var),
(y_var,'@'+y_var),
('Cluster', '@'+cluster_id_column)
]
r = figure(plot_width=420, plot_height=400,title=title+ ' : '+ x_var+' vs '+y_var, tooltips=tooltips3, toolbar_location=None)
for num,index in enumerate(cluster_ids) :
condition = dataframe[cluster_id_column] == index
source = dataframe[condition]
r.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , legend_label=index, muted_alpha=0.1 )
r.xaxis.axis_label = x_var
r.yaxis.axis_label = y_var
r.legend.click_policy="mute"
show(row(p,q,r))
# Cluster Profiling for k-means with 4 clusters
# hover for country names and x, y values , cluster no
# Click on legend to selectively view clusters
cluster_analysis_plot('k_means_cluster_id','k-means clusters')
# Comparing k-means Clusters using mean values of features
clustering_data[['child_mort','income','gdpp','k_means_cluster_id']].groupby('k_means_cluster_id').mean().plot(kind='barh')
plt.title('Comparison of Cluster Means for K-means results');
plot_columns = ['child_mort','income','gdpp']
for idx,column in enumerate(plot_columns) :
plt.suptitle('Comparison of Clusters Characteristics for K-means');
plt.subplot('13'+str(idx+1))
sns.boxplot(y=column, x='k_means_cluster_id',data=clustering_data)
plt.figure(figsize=[8,8])
pd.plotting.parallel_coordinates(clustering_data, 'k_means_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e"]);
plt.title('Parallel Coordinate Plot for K-means')
plt.xticks(rotation=45);
From the above plots, we could characterize features of each cluster as within the following levels.
Characteristics of each cluster
Cluster | GDP | Income | Child Mortality |
---|---|---|---|
0 | High | High to Very High | Low |
1 | Low | Low | High to Very High |
2 | Low to Moderate | Low to Moderate | Low to Moderate |
3 | Very High | Very High | Low |
From the characteristics , we see that the Cluster 1 is our area of interest. Lets look at the countries in Cluster 1.
# Countries in cluster with area of interest
condition = countries['k_means_cluster_id'] == 1
countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]
# HAC : Complete linkage, Correlation based distance : cluster analysis plot
# hover for country names and x, y values , cluster no
# Click on legend to selectively view clusters
cluster_analysis_plot('hac_correlation_cluster_id','HAC CORRELATION CLUSTERS')
# Comparing Hierarchical Clusters using mean values of features
clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
clustering_data[['child_mort','income','gdpp','hac_correlation_cluster_id']].groupby('hac_correlation_cluster_id').mean().plot(kind='barh');
plt.title('Comparison of Cluster Means for Hierarchical Clustering');
# box plots
plot_columns = ['child_mort','income','gdpp']
for idx,column in enumerate(plot_columns) :
plt.suptitle('Comparison of Clusters Characteristics for Hierarchical Clustering');
plt.subplot('13'+str(idx+1))
sns.boxplot(y=column, x='hac_correlation_cluster_id',data=clustering_data)
# Parallel Coordinates plot for Hierarchical clustering with correlation measure and complet linkage
plt.figure(figsize=[8,8])
pd.plotting.parallel_coordinates(clustering_data, 'hac_correlation_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
plt.title('Parallel Coordinate Plot for HAC with Correlation Measure and Complete Linkage')
plt.xticks(rotation=45);
From the above plots, we could characterize features of each cluster as within the following levels.
Characteristics of each cluster
Cluster | GDP | Income | Child Mortality |
---|---|---|---|
0 | Very low | Very Low | Very High |
1 | low | low | low |
2 | Moderate | Moderate | Moderate |
3 | High | High | Low |
4 | Very High | Very High | Low |
5 | Very Low | Very Low | Low |
From the characteristics , we see that the Cluster 0 is our area of interest. Lets look at the countries in Cluster 0.
# Countries in cluster with area of interest
condition = countries['hac_correlation_cluster_id'] == 0
countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]
cluster_analysis_plot('mixed_cluster_id','MIXED K-MEANS CLUSTERS')
# Comparing MIXED k-means Clusters using mean values of features
clustering_data['mixed_cluster_id'] = countries['mixed_cluster_id']
clustering_data[['child_mort','income','gdpp','mixed_cluster_id']].groupby('mixed_cluster_id').mean().plot(kind='barh')
plt.title('Comparison of cluster means for Mixed Clustering');
# box plots
plot_columns = ['child_mort','income','gdpp']
for idx,column in enumerate(plot_columns) :
plt.suptitle('Comparison of Clusters Characteristics for Mixed Clustering');
plt.subplot('13'+str(idx+1))
sns.boxplot(y=column, x='mixed_cluster_id',data=clustering_data)
# parallel coordinate plot
plt.figure(figsize=[8,8])
pd.plotting.parallel_coordinates(clustering_data, 'mixed_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
plt.title('Parallel Coordinate Plot for Mixed Clustering');
plt.xticks(rotation=45);
# Countries in cluster with area of interest
condition = countries['mixed_cluster_id'] == 0
countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]
Columns : 'child_mort', 'exports', 'health', 'imports', 'income','inflation', 'life_expec', 'total_fer', 'gdpp','trade_deficit' were used for clustering.
Hopkin's Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%
Due to more interpretable results, Hierarchical Clustering with Complete linkage and correlation based distance measure was chosen to arrive at the results.
Characteristics of Clusters obtained :
Cluster | GDP | Income | Child Mortality |
---|---|---|---|
0 | Very low | Very Low | Very High |
1 | low | low | low |
2 | Moderate | Moderate | Moderate |
3 | High | High | Low |
4 | Very High | Very High | Low |
5 | Very Low | Very Low | Low |