Jayanth Boddu
Clustering
An NGO wants to use their funding strategically so that they could aid five countries in dire need of help. The analyst's objective is to use clustering algorithms to group countries based on socio-economic and health factors to judge the overall development of countries. Further,the final deliverable is - suggesting 5 countries that need the aid the most.
| Feature | High / Low | 
|---|---|
| Child Mortality | High | 
| Life Expectancy | Low | 
| Fertility | High | 
| Health Spending | Low | 
| GDP per capita | Low | 
| Inflation Index | High | 
| Income per Person | Low | 
| Imports | High | 
| Exports | Low | 
A new feature 'Trade Deficit' has been derived.
Univariate analysis revealed the following information
| Feature | Highest | Lowest | 
|---|---|---|
| Child Mortality | Haiti,Sierra Leone | Iceland | 
| Life Expectancy | Japan , Singapore | Haiti, Lesotho | 
| Total fertility | Chad, Niger | Singapore, South Korea | 
| Health Spending | Switzerland, US | Madagascar, Eritrea | 
| GDP per capita | Luxemborg , Norway | Burundi , Liberia | 
| Inflation Index | Nigeria, Venezuela | Seycelles, Ireland | 
| Net Income per person | Qatar, Luxemborg | Liberia, Congo Dem. Rep. | 
| Imports per capita | Luxemborg, Singapore | Myanmar, Burundi | 
| Exports per capita | Luxemborg, Singapore | Myanmar, Burundi | 
| Trade Deficit per capita | Bahamas, Greece | Luxemborg, Qatar | 
| Feature | Upper Outliers | Lower Outliers | 
|---|---|---|
| Child Mortality | Not Changed | Capped | 
| Life Expectancy | Capped | Not Changed | 
| Fertility | Not Changed | Capped | 
| Health Spending | Capped | Not Capped | 
| GDP per capita | Capped | Not Changed | 
| Inflation Index | Not Changed | Capped | 
| Income per Person | Capped | Not Changed | 
| Imports | Capped | Not Changed | 
| Exports | Capped | Not Changed | 
Columns : 'child_mort', 'exports', 'health', 'imports', 'income','inflation', 'life_expec', 'total_fer', 'gdpp','trade_deficit' were used for clustering.
Hopkin's Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%
| Cluster | GDP | Income | Child Mortality | 
|---|---|---|---|
| 0 | Very low | Very Low | Very High | 
| 1 | low | low | low | 
| 2 | Moderate | Moderate | Moderate | 
| 3 | High | High | Low | 
| 4 | Very High | Very High | Low | 
| 5 | Very Low | Very Low | Low | 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set_style('whitegrid') 
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
import warnings
warnings.filterwarnings('ignore')
!pip install tabulate
!jt -f roboto -fs 12 -cellw 100%
from tabulate import tabulate
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure,show,output_notebook,reset_output
from bokeh.transform import factor_cmap
from bokeh.layouts import row
output_notebook()
# to table print  a dataframe 
def tab(ser) : 
    print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt='psql'))
countries = pd.read_csv('./Country-data.csv')
tab(countries.head())
exports, imports,health as a proportion of GDP per capita# conversion to actual values
countries['exports'] = 0.01 * countries['exports'] * countries['gdpp']
countries['imports'] = 0.01 * countries['imports'] * countries['gdpp']
countries['health'] = 0.01 * countries['health'] * countries['gdpp']
tab(countries.info())
# taking a closer look to see if any country is duplicated 
countries[countries.duplicated(subset=['country'])].index.values
# exports, health, imports are a percentage of GDPP . Checking if there are any anomalies
condition1 = countries['health'] < countries['gdpp']
condition2 = countries['imports'] < countries['gdpp']
condition3 = countries['exports'] < countries['gdpp']
# countries which don't satisfy the above conditions
index = countries[~(condition1 & condition2 & condition3)].index.values
countries.loc[index]
# child mortality rate is calculated for 1000 live births. Let's check if its above 1000 
countries[countries['child_mort'] > 1000].index.values
# summary statistics
tab(countries.describe())
# let's look at quantiles are to take a closer look at outliers 
tab(countries.quantile(np.linspace(0.75,1,25)).reset_index())
exports,child_mort, imports, income, inflation, life_expec, total_fer and gdpp have outliers.  # function to perform outlier analysis 
def outlier_analysis(column) : 
    '''
    This function prints a violin plot and box plot of the column provided.
    It also prints the five major quantiles, lower oultier threshold value, upper outlier threshold value, tables of countries which are outliers 
    Output : lower outlier threshold condition, upper outlier threshold condition
    Input : column name 
    Side effects : Violin plot, box plot, outlier tables
    '''
    plt.figure(figsize=[12,6])
    plt.subplot(121)
    plt.title('Violin Plot of '+column)
    sns.violinplot(countries[column])
    plt.subplot(122)
    plt.title('Box Plot of '+column)
    sns.boxplot(countries[column])
    print('Quantiles\n')
    print(tab(countries[column].quantile([.1,0.25,.50,0.75,0.99])))
    lower_outlier_threshold = countries[column].quantile(0.01)
    upper_outlier_threshold = countries[column].quantile(0.99)
    print('\n\nLOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR' ,column,': ',lower_outlier_threshold)
    l_condition = countries[column] < lower_outlier_threshold 
    l_outliers = countries[l_condition][['country',column]].sort_values(by=column)
   
    if l_outliers.shape[0] : 
        print('\n\nLower Outliers : ')
        tab(l_outliers)
    else : 
        print('No lower outliers found in ' + column)
    print('\n\nUPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR' ,column,': ',upper_outlier_threshold)
    u_condition = countries[column] > upper_outlier_threshold 
    u_outliers = countries[u_condition][['country',column]].sort_values(by=column)
    if u_outliers.shape[0] : 
        print('\n\nUpper Outliers : ')
        tab(u_outliers)
        print('\n\n')
    return l_condition, u_condition
# Countries with oultiers in child mortality 
column = 'child_mort'
l_condition, u_condition = outlier_analysis(column)
# Removing countries with lower outliers in `child_mort` 
countries.loc[l_condition, column] = countries[column].quantile(0.01)
# LIFE EXPECTANCY 
column = 'life_expec'
l_condition, u_condition = outlier_analysis(column)
# Capping upper outliers in life expectancy 
countries.loc[u_condition, column] = countries[column].quantile(0.99)
column = 'total_fer'
l_condition, u_condition = outlier_analysis(column)
# capping lower outliers in fertility 
countries.loc[l_condition,column] = countries[column].quantile(.1)
# health spending per capita 
column = 'health'
l_condition, u_condition = outlier_analysis(column)
# removing upper outliers in health spending 
countries.loc[u_condition,column] = countries[column].quantile(0.99)
# GDP per capita 
column = 'gdpp'
l_condition, u_condition = outlier_analysis(column)
# capping upper outliers in gdpp
countries.loc[u_condition, column] = countries[column].quantile(.99)
column = 'inflation'
l_condition, u_condition = outlier_analysis(column)
# capping lower outliers in inflation
countries.loc[l_condition, column] = countries[column].quantile(0.01)
# income 
column = 'income'
l_condition, u_condition = outlier_analysis(column)
column = 'imports'
l_condition, u_condition = outlier_analysis(column)
column = 'exports'
l_condition, u_condition = outlier_analysis(column)
# trade deficit
countries['trade_deficit'] = countries['imports'] - countries['exports']
column = 'trade_deficit'
l_condition, u_condition = outlier_analysis(column)
# capping lower outliers of trade_deficit
countries.loc[l_condition,column] = countries[column].quantile(0.01)
# Pair Plots of all variables 
sns.pairplot(countries[['child_mort', 'exports', 'health', 'imports', 'income',
       'inflation', 'life_expec', 'total_fer', 'gdpp']]);
## bivariate analysis boken function
def bivariate_analysis(x_var,y_var,dataframe=countries) : 
    # Bivariate Plots with tooltips : country,x,y 
    dataframe = dataframe.copy()
    source =ColumnDataSource(dataframe)
    # pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
    pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']
    # tooltips
    tooltips1 = [
    ("Country", "@country"),
    (x_var,'@'+x_var),
    (y_var,'@'+y_var)
    ]
    p = figure(plot_width=420, plot_height=400,title='Scatter Plot : '+ x_var+' vs '+y_var, tooltips=tooltips1)
    p.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = "#3498db")
    p.xaxis.axis_label = x_var
    p.yaxis.axis_label = y_var
    reset_output()
    output_notebook()
    show(p)
# CHILD MORTALITY vs LIFE EXPECTANCY 
x='child_mort'
y = 'life_expec'
bivariate_analysis(x,y)
child_mort and life_expec are almost linearly related.# Countries with high child mortality and low life expectancy 
life_expect_cond = countries['life_expec'] < 60 
child_mort_cond = countries['child_mort'] > 100
print('Countries with low life expectancy and low Child Mortality Rate')
tab(countries[life_expect_cond & child_mort_cond][['country','child_mort','life_expec']].sort_values(by=['child_mort','life_expec'], ascending=[False,True])[:10])
# CHILD MORTALITY vs TOTAL FERTILITY 
x='child_mort'
y = 'total_fer'
bivariate_analysis(x,y)
# countries with high fertility and child mortality rate 
total_fer_cond = countries['total_fer'] > 6 
child_mort_cond = countries['child_mort'] > 100
print('Countries with High Total Fertility and Child Mortality Rate')
tab(countries[total_fer_cond & child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))
# Fertility in countries with extremely high mortality rate
child_mort_cond = countries['child_mort'] > 150
print('Countries with High Total Fertility and Child Mortality Rate')
tab(countries[child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))
# GDP per capita vs Health Spending 
y='health'
x= 'gdpp'
bivariate_analysis(x,y)
# Countries with Low GDPP and health spending might need aid 
health_cond = countries['health'] < 200
gdpp_cond = countries['gdpp'] < 2500
print('Countries with Low GDP and low Health Spending')
tab(countries[health_cond & gdpp_cond][['country','health','gdpp']].sort_values(by=['health','gdpp'], ascending=[True,True])[:10])
# IMPORTS vs GDPP
y='imports'
x= 'gdpp'
bivariate_analysis(x,y)
# Exports vs GDPP
y='exports'
x= 'gdpp'
bivariate_analysis(x,y)
y='trade_deficit'
x= 'gdpp'
bivariate_analysis(x,y)
# Countries with positive high trade deficit and low GDP per capita
tr_deficit_cond = countries['trade_deficit'] > 1000
gdpp_cond = countries['gdpp'] < 10000
print('Countries with high Trade Deficit and Low GDP')
tab(countries[tr_deficit_cond & gdpp_cond][['country','gdpp','trade_deficit']].sort_values(by=['trade_deficit','gdpp'], ascending=[True,False])[:10])
# Income vs Inflation
x = 'inflation'
y = 'income'
bivariate_analysis(x,y)
# Low Income - High inflation countries might require aid 
inflation_cond = countries['inflation'] > 20
income_condition = countries['income'] < 10000
print('Countries with high inflation and low income')
tab(countries[inflation_cond & income_condition][['country','inflation','income']].sort_values(by=['inflation','income'], ascending=[False, True]))
x = 'gdpp'
y = 'inflation'
bivariate_analysis(x,y)
plt.figure(figsize=[12,12])
sns.heatmap(countries.corr(),annot=True,cmap='YlGnBu', center=0)
Top Correlations
life_expec and child_morttotal_fer and child_mortAlthough clustering analysis is not affected by multicollinearity, this plot shows us the possible linear relationships between different features to help with results obtained from cluster analysis.
# hopkins test function
from sklearn.neighbors import NearestNeighbors
from random import sample 
from numpy.random import uniform
from math import isnan 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H
## Data used for Clustering 
columns_for_clustering = ['child_mort', 'exports', 'health', 'imports', 'income',
       'inflation', 'life_expec', 'total_fer', 'gdpp', 'trade_deficit']
clustering_data = countries[columns_for_clustering].copy()
# Hopkin's test 
n = 10
hopkins_statistic = []
for i in range(n) : 
    hopkins_statistic.append(hopkins(clustering_data))
print('Min Hopkin\'s Statistic in ',n,'iterations :', min(hopkins_statistic))
print('Max Hopkin\'s Statistic in ',n,'iterations :', max(hopkins_statistic))
print('Mean Hopkin\'s Statistic in ',n,'iterations :', np.mean(hopkins_statistic))
print('Std deviation of Hopkin\'s Statistic in ',n,'iterations :', np.std(hopkins_statistic))
Since hopkin's statistic is greater than 80% , the data shows good clustering tendency
from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler()
clustering_data[columns_for_clustering] = scaler.fit_transform(clustering_data[columns_for_clustering])
tab(clustering_data.describe())
# Plotting Elbow curve of Sum of Squared distances of points in each cluster from the centroid of the nearest cluster.
ssd = []
range_n_clusters = np.arange(2,9)
for num_clusters in range_n_clusters : 
    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(clustering_data)
    ssd.append(kmeans.inertia_)
plt.plot(range_n_clusters,ssd)
plt.title('Elbow Curve');
plt.xlabel('No of clusters');
plt.ylabel('SSD');
from sklearn.metrics import silhouette_score 
no_of_clusters = np.arange(2,10)
score = [] 
for n_cluster in no_of_clusters : 
    kmeans = KMeans(n_clusters=n_cluster, init='k-means++')
    kmeans = kmeans.fit(clustering_data)
    labels = kmeans.labels_
    score.append(silhouette_score(clustering_data,labels))
    
plt.title('Silhouette Analysis Plot')
plt.xlabel('No of Clusters')
plt.ylabel('Silhouette Score') 
plt.plot(no_of_clusters, score);
print(score)
# k - means clustering algo with k = 4
n_cluster = 4
kmeans = KMeans(n_clusters=n_cluster, init='k-means++', random_state = 100)
kmeans = kmeans.fit(clustering_data)
labels = kmeans.labels_
countries['k_means_cluster_id'] = labels
# Countries in each Cluster - k means 
for cluster_no in range(n_cluster) : 
    condition = countries['k_means_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
# Agglomerative Single Linkage 
mergings = linkage(clustering_data,method='single',metric='euclidean')
plt.figure(figsize=[16,10])
plt.title('Single Linkage - Hierarchical Clustering')
dendrogram(mergings);
# Complete Linkage 
mergings = linkage(clustering_data,method='complete',metric='euclidean')
plt.figure(figsize=[16,10])
plt.title('Complete Linkage - Hierarchical Clustering')
dendrogram(mergings);
# Using Complete Linkage, cutting the tree for 6 clusters 
n_clusters = 6
cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
countries['hac_complete_cluster_id'] = cluster_labels
# Countries in each Cluster - Hierarchical - Complete Linkage
for cluster_no in range(n_clusters) : 
    condition = countries['hac_complete_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
# Average Linkage 
mergings = linkage(clustering_data,method='average',metric='euclidean')
plt.figure(figsize=[16,10])
plt.title('Average Linkage - Hierarchical Clustering')
dendrogram(mergings);
# Using Average Linkage, cutting the tree for 6 clusters 
n_clusters = 6
cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
countries['hac_average_cluster_id'] = cluster_labels
# Countries in each Cluster - Hierarchical - average Linkage
for cluster_no in range(n_clusters) : 
    condition = countries['hac_average_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
## HAC Clustering : Dissimilarity Measure : Correlation
hac_correlation_mergings = linkage(clustering_data,method='complete', metric='correlation')
plt.figure(figsize=[12,12])
plt.title('Hierarchical Clustering : Complete Linkage, Correlation Measure')
dendrogram(hac_correlation_mergings);
n_clusters = 6
labels = cut_tree(hac_correlation_mergings, n_clusters=n_clusters)
countries['hac_correlation_cluster_id'] = labels
# HAC clustering : Correlation measure : complete distance : Countries in each cluster 
for cluster_no in range(n_clusters) : 
    condition = countries['hac_correlation_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
# Performing k-means using results of Hierarchical clustering
# 1. No of clusters of Hierarchical Clustering
# 2. Centroids obtainded from Hierarchical Clustering as the initialization points. 
clustering_data['k_means_cluster_id'] = countries['k_means_cluster_id']
clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
columns = columns_for_clustering.copy()
columns.extend(['hac_correlation_cluster_id'])
centroids = clustering_data[columns].groupby(['hac_correlation_cluster_id']).mean()
n_clusters = 6
mixed_kmeans = KMeans(n_clusters=6 , init = centroids.values, random_state=100)
results = mixed_kmeans.fit(clustering_data[columns_for_clustering])
countries['mixed_cluster_id'] = results.labels_
# Mixed clustering : Euclidean measure : k-means : Countries in each cluster 
for cluster_no in range(n_clusters) : 
    condition = countries['mixed_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
# silhouette scores of all the methods
print('Mixed Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['mixed_cluster_id']))
print('K-means Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['k_means_cluster_id']))
print('Hierarchical Correlation Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['hac_correlation_cluster_id']))
# Clustering Profling -  Plots using Bokeh
cluster_id_column = 'hierarchical-c-link-cluster-id'
title="Hierarchical Clustering"
def cluster_analysis_plot(cluster_id_column,title,x_var='income',y_var='child_mort',z_var='gdpp',dataframe=countries) : 
    # Plots 
    # works upto 6 clusters 
    dataframe = dataframe.copy()
    dataframe[cluster_id_column] = dataframe[cluster_id_column].astype('str')
    source =ColumnDataSource(dataframe)
    cluster_ids = sorted(dataframe[cluster_id_column].unique())
    # pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
    pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']
    mapper = factor_cmap(cluster_id_column,palette=pallete[:len(cluster_ids)], factors = cluster_ids)
    # plot 1
    tooltips1 = [
    ("Country", "@country"),
    (z_var,'@'+z_var),
    (x_var,'@'+x_var), 
    ('Cluster', '@'+cluster_id_column)
    ]
    p = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+x_var, tooltips=tooltips1, toolbar_location=None)
    for num,index in enumerate(cluster_ids) : 
        condition = dataframe[cluster_id_column] == index
        source = dataframe[condition]
        p.scatter(x=z_var,y=x_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1, legend_label=index)
    p.xaxis.axis_label = z_var
    p.yaxis.axis_label = x_var
    p.legend.click_policy="mute"
    # ----------------------------
    #Plot 2
    tooltips2 = [
    ("Country", "@country"),
    (z_var,'@'+z_var),
    (y_var,'@'+y_var), 
    ('Cluster', '@'+cluster_id_column)
    ]
    
    q = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+y_var, tooltips=tooltips2, toolbar_location=None)
    for num,index in enumerate(cluster_ids) : 
        condition = dataframe[cluster_id_column] == index
        source = dataframe[condition]
        q.scatter(x=z_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1,  legend_label=index)
    q.xaxis.axis_label = z_var
    q.yaxis.axis_label = y_var
    q.legend.click_policy="mute"
    # ----------------------------
    #Plot 3
    tooltips3 = [
    ("Country", "@country"),
    (x_var,'@'+x_var),
    (y_var,'@'+y_var), 
    ('Cluster', '@'+cluster_id_column)
    ]
    r = figure(plot_width=420, plot_height=400,title=title+ ' : '+ x_var+' vs '+y_var, tooltips=tooltips3, toolbar_location=None)
    for num,index in enumerate(cluster_ids) : 
        condition = dataframe[cluster_id_column] == index
        source = dataframe[condition]
        r.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , legend_label=index, muted_alpha=0.1 )
    r.xaxis.axis_label = x_var
    r.yaxis.axis_label = y_var
    r.legend.click_policy="mute"
    show(row(p,q,r))
# Cluster Profiling for k-means with 4 clusters 
# hover for country names and x, y values , cluster no 
# Click on legend to selectively view clusters
cluster_analysis_plot('k_means_cluster_id','k-means clusters')
# Comparing k-means Clusters using mean values of features 
clustering_data[['child_mort','income','gdpp','k_means_cluster_id']].groupby('k_means_cluster_id').mean().plot(kind='barh')
plt.title('Comparison of Cluster Means for K-means results');
plot_columns = ['child_mort','income','gdpp']
for idx,column in enumerate(plot_columns) : 
    plt.suptitle('Comparison of Clusters Characteristics for K-means');
    plt.subplot('13'+str(idx+1))
    sns.boxplot(y=column, x='k_means_cluster_id',data=clustering_data)
plt.figure(figsize=[8,8])
pd.plotting.parallel_coordinates(clustering_data, 'k_means_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e"]);
plt.title('Parallel Coordinate Plot for K-means')
plt.xticks(rotation=45);
From the above plots, we could characterize features of each cluster as within the following levels.
Characteristics of each cluster
| Cluster | GDP | Income | Child Mortality | 
|---|---|---|---|
| 0 | High | High to Very High | Low | 
| 1 | Low | Low | High to Very High | 
| 2 | Low to Moderate | Low to Moderate | Low to Moderate | 
| 3 | Very High | Very High | Low | 
From the characteristics , we see that the Cluster 1 is our area of interest. Lets look at the countries in Cluster 1.
# Countries in cluster with area of interest 
condition = countries['k_means_cluster_id'] == 1
countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]
# HAC : Complete linkage, Correlation based distance :  cluster analysis plot
# hover for country names and x, y values , cluster no 
# Click on legend to selectively view clusters 
cluster_analysis_plot('hac_correlation_cluster_id','HAC CORRELATION CLUSTERS')
# Comparing Hierarchical Clusters using mean values of features 
clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
clustering_data[['child_mort','income','gdpp','hac_correlation_cluster_id']].groupby('hac_correlation_cluster_id').mean().plot(kind='barh');
plt.title('Comparison of Cluster Means for Hierarchical Clustering');
# box plots
plot_columns = ['child_mort','income','gdpp']
for idx,column in enumerate(plot_columns) : 
    plt.suptitle('Comparison of Clusters Characteristics for Hierarchical Clustering');
    plt.subplot('13'+str(idx+1))
    sns.boxplot(y=column, x='hac_correlation_cluster_id',data=clustering_data)
# Parallel Coordinates plot for Hierarchical clustering with correlation measure and complet linkage
plt.figure(figsize=[8,8])
pd.plotting.parallel_coordinates(clustering_data, 'hac_correlation_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
plt.title('Parallel Coordinate Plot for HAC with Correlation Measure and Complete Linkage')
plt.xticks(rotation=45);
From the above plots, we could characterize features of each cluster as within the following levels.
Characteristics of each cluster
| Cluster | GDP | Income | Child Mortality | 
|---|---|---|---|
| 0 | Very low | Very Low | Very High | 
| 1 | low | low | low | 
| 2 | Moderate | Moderate | Moderate | 
| 3 | High | High | Low | 
| 4 | Very High | Very High | Low | 
| 5 | Very Low | Very Low | Low | 
From the characteristics , we see that the Cluster 0 is our area of interest. Lets look at the countries in Cluster 0.
# Countries in cluster with area of interest 
condition = countries['hac_correlation_cluster_id'] == 0
countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]
cluster_analysis_plot('mixed_cluster_id','MIXED K-MEANS CLUSTERS')
# Comparing MIXED k-means Clusters using mean values of features 
clustering_data['mixed_cluster_id'] = countries['mixed_cluster_id']
clustering_data[['child_mort','income','gdpp','mixed_cluster_id']].groupby('mixed_cluster_id').mean().plot(kind='barh')
plt.title('Comparison of cluster means for Mixed Clustering');
# box plots
plot_columns = ['child_mort','income','gdpp']
for idx,column in enumerate(plot_columns) : 
    plt.suptitle('Comparison of Clusters Characteristics for Mixed Clustering');
    plt.subplot('13'+str(idx+1))
    sns.boxplot(y=column, x='mixed_cluster_id',data=clustering_data)
# parallel coordinate plot
plt.figure(figsize=[8,8])
pd.plotting.parallel_coordinates(clustering_data, 'mixed_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
plt.title('Parallel Coordinate Plot for Mixed Clustering');
plt.xticks(rotation=45);
# Countries in cluster with area of interest 
condition = countries['mixed_cluster_id'] == 0
countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]
Columns : 'child_mort', 'exports', 'health', 'imports', 'income','inflation', 'life_expec', 'total_fer', 'gdpp','trade_deficit' were used for clustering.
Hopkin's Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%
Due to more interpretable results, Hierarchical Clustering with Complete linkage and correlation based distance measure was chosen to arrive at the results.
Characteristics of Clusters obtained :
| Cluster | GDP | Income | Child Mortality | 
|---|---|---|---|
| 0 | Very low | Very Low | Very High | 
| 1 | low | low | low | 
| 2 | Moderate | Moderate | Moderate | 
| 3 | High | High | Low | 
| 4 | Very High | Very High | Low | 
| 5 | Very Low | Very Low | Low |