1 Problem Statement
2 Analysis Approach & Conclusions
3 Importing data
4 Data Quality Checks
5 Exploratory Data Analysis
6 Hopkin's Statistic
7 Standardizing Values
8 K-Means Clustering
- 8.1 Finding Optimal Number of Clusters
  - 8.1.1 Elbow curve
  - 8.1.2 Sihoutte Analysis
- 8.2 Final k - Means Clustering
9 Hierarchical Clustering
10 Mixed K-Means Clustering
11 Cluster Profiling
12 Conclusion

HELP - Countries to Aid¶

Jayanth Boddu
Clustering

Problem Statement¶

An NGO wants to use their funding strategically so that they could aid five countries in dire need of help. The analyst's objective is to use clustering algorithms to group countries based on socio-economic and health factors to judge the overall development of countries. Further,the final deliverable is - suggesting 5 countries that need the aid the most.

Analysis Approach & Conclusions¶

The objective of the analysis is to recommend 05 countries in dire need of aid of help.
The achieve this, the following features of 167 countries have been analyzed.
- GDP per capita
- Child mortality per 1000 live births
- Net Income per person
- Fertility rates at the current previaling age/fertility rate
- Inflation index
- Life Expectancy
- Imports per capita
- Exports per capita
- Total Health Spending per capita
The above features are further grouped into health, economic and policy indicators
- Health Indicators
  - Child mortality per 1000 births
  - Fertility rates at the current previaling age/fertility index
  - Life Expectancy
- Economic Indicators
  - Net Income per person
  - Imports per capita
  - Exports per capita
  - Inflation Index
- Policy Indicators
  - Total Health Spending per capita
From the above, countries deduced to be in dire need of aid are countries with bad economic indicators and bad health indicators having low health spending
Further on the feature level, the following are general indicators of need of aid.

Feature	High / Low
Child Mortality	High
Life Expectancy	Low
Fertility	High
Health Spending	Low
GDP per capita	Low
Inflation Index	High
Income per Person	Low
Imports	High
Exports	Low

Data has been check for quality.No Data Quality Issues have been found - No missing values, No duplicates , No incorrect data types
A new feature 'Trade Deficit' has been derived.
- Trade Deficit = Imports - Exports
- Trade Deficit or Trade balance is a good indicator of economic health of countries with low GDP per capita
- Lower or negative is better than higher and positive.
Univariate analysis revealed the following information

Feature	Highest	Lowest
Child Mortality	Haiti,Sierra Leone	Iceland
Life Expectancy	Japan , Singapore	Haiti, Lesotho
Total fertility	Chad, Niger	Singapore, South Korea
Health Spending	Switzerland, US	Madagascar, Eritrea
GDP per capita	Luxemborg , Norway	Burundi , Liberia
Inflation Index	Nigeria, Venezuela	Seycelles, Ireland
Net Income per person	Qatar, Luxemborg	Liberia, Congo Dem. Rep.
Imports per capita	Luxemborg, Singapore	Myanmar, Burundi
Exports per capita	Luxemborg, Singapore	Myanmar, Burundi
Trade Deficit per capita	Bahamas, Greece	Luxemborg, Qatar

Since there's less data per cluster, soft range of 1st percentile - 99th perentile has been used to classify and cap outliers. Only those outliers which do not skew the areas of interest have been capped.
Outlier Treatment Summary

Feature	Upper Outliers	Lower Outliers
Child Mortality	Not Changed	Capped
Life Expectancy	Capped	Not Changed
Fertility	Not Changed	Capped
Health Spending	Capped	Not Capped
GDP per capita	Capped	Not Changed
Inflation Index	Not Changed	Capped
Income per Person	Capped	Not Changed
Imports	Capped	Not Changed
Exports	Capped	Not Changed

Bivariate Analysis revealed the following insights
- There is a negative relationship between Child Mortality and Life expectancy. As child mortality increases, Life Expectancy decreases. Haiti and Sierra Leone have the highest child mortality and lowest life expectancy.
- There is positive relationship between child mortality and total fertility. Ex : Chad and Mali
- Health Spending has a positive relationship with GDP per capita , but the situation is dire in case of low GDP countries like Eritrea & Madagascar.
- There's very high correlation between child mortality and life expectancy , child mortality and total fertility.

Columns : 'child_mort', 'exports', 'health', 'imports', 'income','inflation', 'life_expec', 'total_fer', 'gdpp','trade_deficit' were used for clustering.
Hopkin's Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%
The dataset was standardized with mean = 0 , standard deviation = 1 before clustering.
Optimum no of clusters for k-means was found to be 4 , both from Elbow curve and Silhoeutte Analysis curve.
Clustering with k-means was performed (Cluster centers initialized with k-means++)
Hierarchical Clustering with Single, Complete and Average linkages and Euclidean , correlation based distances were explored.
Mixed Clustering - K-Means clustering initialised with centroids of Hierarchical clusters , 6 no of clusters - was carried out.
Finally, Hierarchical Clustering with Complete linkage and correlation based distance measure was chosen to arrive at the results. Choice was made based on interpretability of results.
Characteristics of Clusters obtained :

Cluster	GDP	Income	Child Mortality
0	Very low	Very Low	Very High
1	low	low	low
2	Moderate	Moderate	Moderate
3	High	High	Low
4	Very High	Very High	Low
5	Very Low	Very Low	Low

From the characteristics , we see that the Cluster 0 is our area of interest.
According to the UN goals of 2030, the top priority is health and then poverty.
Hence countries in cluster 0 were ranked based on Child Mortality followed by income and GDP per capita.
By that criteria, the following are the five countries HELP should consider extending their aid.
- Haiti
- Sierra Leone
- Chad
- Central African Republic
- Mali

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

import seaborn as sns 
sns.set_style('whitegrid') 

from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

import warnings
warnings.filterwarnings('ignore')

!pip install tabulate
!jt -f roboto -fs 12 -cellw 100%

from tabulate import tabulate

from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure,show,output_notebook,reset_output
from bokeh.transform import factor_cmap
from bokeh.layouts import row
output_notebook()

Requirement already satisfied: tabulate in /Users/jayanth/opt/anaconda3/lib/python3.7/site-packages (0.8.7)

# to table print  a dataframe 
def tab(ser) : 
    print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt='psql'))

Importing data¶

countries = pd.read_csv('./Country-data.csv')
tab(countries.head())

+----+---------------------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+--------+
|    | country             |   child_mort |   exports |   health |   imports |   income |   inflation |   life_expec |   total_fer |   gdpp |
|----+---------------------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+--------|
|  0 | Afghanistan         |         90.2 |      10   |     7.58 |      44.9 |     1610 |        9.44 |         56.2 |        5.82 |    553 |
|  1 | Albania             |         16.6 |      28   |     6.55 |      48.6 |     9930 |        4.49 |         76.3 |        1.65 |   4090 |
|  2 | Algeria             |         27.3 |      38.4 |     4.17 |      31.4 |    12900 |       16.1  |         76.5 |        2.89 |   4460 |
|  3 | Angola              |        119   |      62.3 |     2.85 |      42.9 |     5900 |       22.4  |         60.1 |        6.16 |   3530 |
|  4 | Antigua and Barbuda |         10.3 |      45.5 |     6.03 |      58.9 |    19100 |        1.44 |         76.8 |        2.13 |  12200 |
+----+---------------------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+--------+

The dataset contains exports, imports,health as a proportion of GDP per capita
Let's convert them to actual values

# conversion to actual values
countries['exports'] = 0.01 * countries['exports'] * countries['gdpp']
countries['imports'] = 0.01 * countries['imports'] * countries['gdpp']
countries['health'] = 0.01 * countries['health'] * countries['gdpp']

Data Quality Checks¶

tab(countries.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     167 non-null    object 
 1   child_mort  167 non-null    float64
 2   exports     167 non-null    float64
 3   health      167 non-null    float64
 4   imports     167 non-null    float64
 5   income      167 non-null    int64  
 6   inflation   167 non-null    float64
 7   life_expec  167 non-null    float64
 8   total_fer   167 non-null    float64
 9   gdpp        167 non-null    int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 13.2+ KB

167 countries
No apparent missing values
No incorrect data types

# taking a closer look to see if any country is duplicated 
countries[countries.duplicated(subset=['country'])].index.values

array([], dtype=int64)

No duplicate countries

# exports, health, imports are a percentage of GDPP . Checking if there are any anomalies
condition1 = countries['health'] < countries['gdpp']
condition2 = countries['imports'] < countries['gdpp']
condition3 = countries['exports'] < countries['gdpp']

# countries which don't satisfy the above conditions
index = countries[~(condition1 & condition2 & condition3)].index.values
countries.loc[index]

Some countries have greater imports than GDP per capita

# child mortality rate is calculated for 1000 live births. Let's check if its above 1000 
countries[countries['child_mort'] > 1000].index.values

array([], dtype=int64)

Sanity checks completed.
No Data Quality issues found.

Exploratory Data Analysis¶

# summary statistics
tab(countries.describe())

+-------+--------------+--------------+-----------+---------------+----------+-------------+--------------+-------------+----------+
|       |   child_mort |      exports |    health |       imports |   income |   inflation |   life_expec |   total_fer |     gdpp |
|-------+--------------+--------------+-----------+---------------+----------+-------------+--------------+-------------+----------|
| count |     167      |    167       |  167      |    167        |    167   |   167       |    167       |   167       |    167   |
| mean  |      38.2701 |   7420.62    | 1056.73   |   6588.35     |  17144.7 |     7.78183 |     70.5557  |     2.94796 |  12964.2 |
| std   |      40.3289 |  17973.9     | 1801.41   |  14710.8      |  19278.1 |    10.5707  |      8.89317 |     1.51385 |  18328.7 |
| min   |       2.6    |      1.07692 |   12.8212 |      0.651092 |    609   |    -4.21    |     32.1     |     1.15    |    231   |
| 25%   |       8.25   |    447.14    |   78.5355 |    640.215    |   3355   |     1.81    |     65.3     |     1.795   |   1330   |
| 50%   |      19.3    |   1777.44    |  321.886  |   2045.58     |   9960   |     5.39    |     73.1     |     2.41    |   4660   |
| 75%   |      62.1    |   7278       |  976.94   |   7719.6      |  22800   |    10.75    |     76.8     |     3.88    |  14050   |
| max   |     208      | 183750       | 8663.6    | 149100        | 125000   |   104       |     82.8     |     7.49    | 105000   |
+-------+--------------+--------------+-----------+---------------+----------+-------------+--------------+-------------+----------+

The Maximum values of all the attributes are much higher than their 75% values.

# let's look at quantiles are to take a closer look at outliers 
tab(countries.quantile(np.linspace(0.75,1,25)).reset_index())

+----+----------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+----------+
|    |    index |   child_mort |   exports |   health |   imports |   income |   inflation |   life_expec |   total_fer |     gdpp |
|----+----------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+----------|
|  0 | 0.75     |      62.1    |   7278    |   976.94 |   7719.6  |  22800   |     10.75   |      76.8    |     3.88    |  14050   |
|  1 | 0.760417 |      62.2917 |   7846.5  |  1004.03 |   7977.36 |  23581.2 |     11.1229 |      76.9458 |     4.11667 |  16137.5 |
|  2 | 0.770833 |      62.6958 |   7935.28 |  1012.41 |   8220.78 |  27116.7 |     11.5833 |      77.4833 |     4.26875 |  17079.2 |
|  3 | 0.78125  |      63.6688 |   9400.55 |  1028.18 |   8366.03 |  28300   |     12.1    |      77.8688 |     4.36062 |  19300   |
|  4 | 0.791667 |      64.1083 |   9937.67 |  1141.28 |   9561.67 |  28700   |     12.3833 |      78.025  |     4.53083 |  20175   |
|  5 | 0.802083 |      67.5438 |  10220.6  |  1276.05 |   9904.05 |  29600   |     12.6313 |      78.2729 |     4.60146 |  21245.8 |
|  6 | 0.8125   |      74.35   |  10655.8  |  1436.87 |  10029.1  |  30300   |     13.75   |      79.05   |     4.6625  |  22450   |
|  7 | 0.822917 |      78.0292 |  10815.5  |  1548.88 |  10070.1  |  32420.8 |     14.1208 |      79.5    |     4.8225  |  25514.6 |
|  8 | 0.833333 |      80.5333 |  10933.1  |  1829.69 |  10318.9  |  33766.7 |     14.5    |      79.6    |     4.90333 |  28866.7 |
|  9 | 0.84375  |      83.4188 |  11075.8  |  1867.65 |  10882.2  |  35825   |     15.1125 |      79.8063 |     4.9825  |  30706.2 |
| 10 | 0.854167 |      89.0708 |  12677.1  |  2207.69 |  11610.8  |  36200   |     15.5375 |      79.9792 |     5.04375 |  33095.8 |
| 11 | 0.864583 |      90.2521 |  13445.8  |  2407.81 |  11848.4  |  37889.6 |     16.0042 |      80.0521 |     5.08604 |  35156.3 |
| 12 | 0.875    |      90.9    |  14457.8  |  2810.22 |  12290.6  |  39950   |     16.2    |      80.15   |     5.2025  |  36475   |
| 13 | 0.885417 |      93.5687 |  15038.4  |  3393.81 |  12905.2  |  40693.7 |     16.5979 |      80.3    |     5.21    |  38891.7 |
| 14 | 0.895833 |      99.0292 |  17034    |  3651.31 |  14711.4  |  41100   |     16.6    |      80.4    |     5.29833 |  41450   |
| 15 | 0.90625  |     104.062  |  19846.1  |  4024.48 |  16043.1  |  42056.2 |     16.9187 |      80.4437 |     5.34875 |  42993.8 |
| 16 | 0.916667 |     109.333  |  23836.8  |  4265.13 |  17350.7  |  43333.3 |     17.4833 |      80.7333 |     5.405   |  44783.3 |
| 17 | 0.927083 |     111      |  24069.1  |  4525.11 |  18097.6  |  45164.6 |     19.4375 |      80.9896 |     5.54646 |  46558.3 |
| 18 | 0.9375   |     112.875  |  26626.7  |  4801.18 |  21864.3  |  45462.5 |     20.2875 |      81.3    |     5.77875 |  47212.5 |
| 19 | 0.947917 |     116      |  30350    |  4908.45 |  23340.7  |  47010.4 |     20.8354 |      81.4    |     5.85062 |  48506.2 |
| 20 | 0.958333 |     119.333  |  33999.5  |  5175.43 |  25846.6  |  55675   |     22.4333 |      81.5167 |     6.15083 |  50433.3 |
| 21 | 0.96875  |     128.688  |  35961.1  |  5867.67 |  32399.7  |  61418.8 |     23.45   |      81.8625 |     6.21688 |  52062.5 |
| 22 | 0.979167 |     143.5    |  45934.9  |  7449.69 |  36739.1  |  73779.2 |     25.7667 |      82      |     6.41167 |  64662.5 |
| 23 | 0.989583 |     152.708  |  61817.4  |  8392.65 |  52676.8  |  83606.2 |     41.0146 |      82.3354 |     6.56083 |  78175   |
| 24 | 1        |     208      | 183750    |  8663.6  | 149100    | 125000   |    104      |      82.8    |     7.49    | 105000   |
+----+----------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+----------+

exports,child_mort, imports, income, inflation, life_expec, total_fer and gdpp have outliers.
These outliers shall be treated in Univariate analysis

Univariate Analysis / Outlier Treatment¶

# function to perform outlier analysis 
def outlier_analysis(column) : 
    '''
    This function prints a violin plot and box plot of the column provided.
    It also prints the five major quantiles, lower oultier threshold value, upper outlier threshold value, tables of countries which are outliers 
    Output : lower outlier threshold condition, upper outlier threshold condition
    Input : column name 
    Side effects : Violin plot, box plot, outlier tables
    '''
    plt.figure(figsize=[12,6])
    plt.subplot(121)
    plt.title('Violin Plot of '+column)
    sns.violinplot(countries[column])

    plt.subplot(122)
    plt.title('Box Plot of '+column)
    sns.boxplot(countries[column])

    print('Quantiles\n')
    print(tab(countries[column].quantile([.1,0.25,.50,0.75,0.99])))

    lower_outlier_threshold = countries[column].quantile(0.01)
    upper_outlier_threshold = countries[column].quantile(0.99)

    print('\n\nLOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR' ,column,': ',lower_outlier_threshold)
    l_condition = countries[column] < lower_outlier_threshold 
    l_outliers = countries[l_condition][['country',column]].sort_values(by=column)
   
    if l_outliers.shape[0] : 
        print('\n\nLower Outliers : ')
        tab(l_outliers)
    else : 
        print('No lower outliers found in ' + column)

    print('\n\nUPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR' ,column,': ',upper_outlier_threshold)
    u_condition = countries[column] > upper_outlier_threshold 
    u_outliers = countries[u_condition][['country',column]].sort_values(by=column)

    if u_outliers.shape[0] : 
        print('\n\nUpper Outliers : ')
        tab(u_outliers)
        print('\n\n')

    return l_condition, u_condition

Child Mortality¶

# Countries with oultiers in child mortality 
column = 'child_mort'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+--------------+
|      |   child_mort |
|------+--------------|
| 0.1  |         4.2  |
| 0.25 |         8.25 |
| 0.5  |        19.3  |
| 0.75 |        62.1  |
| 0.99 |       153.4  |
+------+--------------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR child_mort :  2.8


Lower Outliers : 
+----+-----------+--------------+
|    | country   |   child_mort |
|----+-----------+--------------|
| 68 | Iceland   |          2.6 |
+----+-----------+--------------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR child_mort :  153.40000000000003


Upper Outliers : 
+-----+--------------+--------------+
|     | country      |   child_mort |
|-----+--------------+--------------|
| 132 | Sierra Leone |          160 |
|  66 | Haiti        |          208 |
+-----+--------------+--------------+

Notice that half of the countries have a child mortality below 20.
Further,countries with child mortality less than 1st percentile might not need aid at all. We shall cap their child mortalities to 1st percentile values.
Countries with extremely high child mortality rate - upper outliers (> 99th percentile) are the perfect candidates for aid. Let's keep them as they are for further analysis.

# Removing countries with lower outliers in `child_mort` 
countries.loc[l_condition, column] = countries[column].quantile(0.01)

Life Expectancy¶

# LIFE EXPECTANCY 
column = 'life_expec'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+--------------+
|      |   life_expec |
|------+--------------|
| 0.1  |        57.82 |
| 0.25 |        65.3  |
| 0.5  |        73.1  |
| 0.75 |        76.8  |
| 0.99 |        82.37 |
+------+--------------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR life_expec :  47.160000000000004


Lower Outliers : 
+----+-----------+--------------+
|    | country   |   life_expec |
|----+-----------+--------------|
| 66 | Haiti     |         32.1 |
| 87 | Lesotho   |         46.5 |
+----+-----------+--------------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR life_expec :  82.37


Upper Outliers : 
+-----+-----------+--------------+
|     | country   |   life_expec |
|-----+-----------+--------------|
| 133 | Singapore |         82.7 |
|  77 | Japan     |         82.8 |
+-----+-----------+--------------+

Some of the lowest life expectancy is seen in Haiti and Lesoto.
About 50% of the countries have a life expectany of 73 or below and the other 50% have above 73.
Countries like Singapore and Japan have the highest life expective, better than 99% of the countries.
Countries with very low life expectancy are possible candidates for aid.
Hence, let's keep the lower outliers.
One the other hand, countries like Singapore and Japan might not need aid but these values would skew our analysis. Let's cap these outliers.

# Capping upper outliers in life expectancy 
countries.loc[u_condition, column] = countries[column].quantile(0.99)

Fertility¶

column = 'total_fer'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+-------------+
|      |   total_fer |
|------+-------------|
| 0.1  |      1.452  |
| 0.25 |      1.795  |
| 0.5  |      2.41   |
| 0.75 |      3.88   |
| 0.99 |      6.5636 |
+------+-------------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR total_fer :  1.2431999999999999


Lower Outliers : 
+-----+-------------+-------------+
|     | country     |   total_fer |
|-----+-------------+-------------|
| 133 | Singapore   |        1.15 |
| 138 | South Korea |        1.23 |
+-----+-------------+-------------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR total_fer :  6.563599999999999


Upper Outliers : 
+-----+-----------+-------------+
|     | country   |   total_fer |
|-----+-----------+-------------|
|  32 | Chad      |        6.59 |
| 112 | Niger     |        7.49 |
+-----+-----------+-------------+

About 50% of the countries have a fertility of 2.41 or less.
Lower fertility is seen in developed nations like Singapore and South Korea, where the fertility is less than 99% countries.
Countries with higher total fertility might need the aid more since this mean more health risk.
Fertility in Chad and Niger is higher than 99 percent of the countries.
Since these countries might need the aid, let's leave these values for further analysis.
Countries with fertility rates less than 1 perecent of the population look like they are developed nations. Let's cap these outlier so that they don't skew our analysis.
Further, from the violin plot, notice that fertility has two peaks - 2 and 5. This indicates that fertility rate could be used to effectively segregate countries. More analysis is needed here.

# capping lower outliers in fertility 
countries.loc[l_condition,column] = countries[column].quantile(.1)

Health Spending¶

# health spending per capita 
column = 'health'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+-----------+
|      |    health |
|------+-----------|
| 0.1  |   36.5026 |
| 0.25 |   78.5355 |
| 0.5  |  321.886  |
| 0.75 |  976.94   |
| 0.99 | 8410.33   |
+------+-----------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR health :  17.009362000000003


Lower Outliers : 
+----+------------+----------+
|    | country    |   health |
|----+------------+----------|
| 50 | Eritrea    |  12.8212 |
| 93 | Madagascar |  15.5701 |
+----+------------+----------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR health :  8410.3304


Upper Outliers : 
+-----+---------------+----------+
|     | country       |   health |
|-----+---------------+----------|
| 145 | Switzerland   |   8579   |
| 159 | United States |   8663.6 |
+-----+---------------+----------+

The countries with more health spending are less in need of aid.
Like Switzerland and United States, whose health spending is higher than 99% of the countries. But values like these would skew the entire analysis for aid needing countries. Hence, we'd cap these values with the 99th percentile value.
The countries with less health spending might have variety of reasons like - optimum general health , or bad economic conditions. Let's keep these values for further analysis.

# removing upper outliers in health spending 
countries.loc[u_condition,column] = countries[column].quantile(0.99)

GDP per capita¶

# GDP per capita 
column = 'gdpp'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+---------+
|      |    gdpp |
|------+---------|
| 0.1  |   593.8 |
| 0.25 |  1330   |
| 0.5  |  4660   |
| 0.75 | 14050   |
| 0.99 | 79088   |
+------+---------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR gdpp :  331.62


Lower Outliers : 
+----+-----------+--------+
|    | country   |   gdpp |
|----+-----------+--------|
| 26 | Burundi   |    231 |
| 88 | Liberia   |    327 |
+----+-----------+--------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR gdpp :  79088.00000000004


Upper Outliers : 
+-----+------------+--------+
|     | country    |   gdpp |
|-----+------------+--------|
| 114 | Norway     |  87800 |
|  91 | Luxembourg | 105000 |
+-----+------------+--------+

GDP per capita is a very good indicator of a country's prosperity
Notice that there are three peaks in the violin plot of GDP. These might indicate the three clusters - under developed, developing and developed nations.
Under developed nations are the most in need of aid. Countries like Burundi and Libera have GDPs less than 99% of the countries. Even though they are outliers, let's keep them for further analysis.
But, we could cap GDPs which are greater than GDPs of 99% of the countries (Luxembourg & Norway)

# capping upper outliers in gdpp
countries.loc[u_condition, column] = countries[column].quantile(.99)

Inflation Index¶

column = 'inflation'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+-------------+
|      |   inflation |
|------+-------------|
| 0.1  |      0.5878 |
| 0.25 |      1.81   |
| 0.5  |      5.39   |
| 0.75 |     10.75   |
| 0.99 |     41.478  |
+------+-------------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR inflation :  -2.3487999999999998


Lower Outliers : 
+-----+------------+-------------+
|     | country    |   inflation |
|-----+------------+-------------|
| 131 | Seychelles |       -4.21 |
|  73 | Ireland    |       -3.22 |
+-----+------------+-------------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR inflation :  41.47800000000002


Upper Outliers : 
+-----+-----------+-------------+
|     | country   |   inflation |
|-----+-----------+-------------|
| 163 | Venezuela |        45.9 |
| 113 | Nigeria   |       104   |
+-----+-----------+-------------+

Very high inflation indicates a bad economic state.
The violinplot again indicates three peaks - indicating three possible clusters of inflation - each with progressively less countries.
Countries with very high inflation are might need aid. Since this is our area of interest, let's keep the upper outliers in inflation as they are for further analysis.
Let's cap lower outliers since these look like good economies with no need of aid.

# capping lower outliers in inflation
countries.loc[l_condition, column] = countries[column].quantile(0.01)

Net Income per person¶

# income 
column = 'income'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+----------+
|      |   income |
|------+----------|
| 0.1  |     1524 |
| 0.25 |     3355 |
| 0.5  |     9960 |
| 0.75 |    22800 |
| 0.99 |    84374 |
+------+----------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR income :  742.24


Lower Outliers : 
+----+------------------+----------+
|    | country          |   income |
|----+------------------+----------|
| 37 | Congo, Dem. Rep. |      609 |
| 88 | Liberia          |      700 |
+----+------------------+----------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR income :  84374.00000000003


Upper Outliers : 
+-----+------------+----------+
|     | country    |   income |
|-----+------------+----------|
|  91 | Luxembourg |    91700 |
| 123 | Qatar      |   125000 |
+-----+------------+----------+

High net income per person is an indicator of general prosperity of a country. Such countries do not need aid.
Hence, we shall cap the upper outliers i,e net incomes greater than 99% of that of the other countries.
Lower outliers, countries with net incomes less than 1% of that of the other countries is our area of interest. Lets leave these values as they are for further analysis.

Imports per capita¶

column = 'imports'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+-----------+
|      |   imports |
|------+-----------|
| 0.1  |   211.006 |
| 0.25 |   640.215 |
| 0.5  |  2045.58  |
| 0.75 |  7719.6   |
| 0.99 | 55371.4   |
+------+-----------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR imports :  104.90964000000002


Lower Outliers : 
+-----+-----------+-----------+
|     | country   |   imports |
|-----+-----------+-----------|
| 107 | Myanmar   |  0.651092 |
|  26 | Burundi   | 90.552    |
+-----+-----------+-----------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR imports :  55371.39000000013


Upper Outliers : 
+-----+------------+-----------+
|     | country    |   imports |
|-----+------------+-----------|
| 133 | Singapore  |     81084 |
|  91 | Luxembourg |    149100 |
+-----+------------+-----------+

Exports per capita¶

column = 'exports'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+-----------+
|      |   exports |
|------+-----------|
| 0.1  |   110.225 |
| 0.25 |   447.14  |
| 0.5  |  1777.44  |
| 0.75 |  7278     |
| 0.99 | 64794.3   |
+------+-----------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR exports :  22.243716


Lower Outliers : 
+-----+-----------+-----------+
|     | country   |   exports |
|-----+-----------+-----------|
| 107 | Myanmar   |   1.07692 |
|  26 | Burundi   |  20.6052  |
+-----+-----------+-----------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR exports :  64794.26000000014


Upper Outliers : 
+-----+------------+-----------+
|     | country    |   exports |
|-----+------------+-----------|
| 133 | Singapore  |     93200 |
|  91 | Luxembourg |    183750 |
+-----+------------+-----------+

Trade Deficit per capita¶

# trade deficit
countries['trade_deficit'] = countries['imports'] - countries['exports']
column = 'trade_deficit'
l_condition, u_condition = outlier_analysis(column)

Quantiles

+------+-----------------+
|      |   trade_deficit |
|------+-----------------|
| 0.1  |      -3000.82   |
| 0.25 |       -327.05   |
| 0.5  |         89.2182 |
| 0.75 |        518.57   |
| 0.99 |       2270.5    |
+------+-----------------+
None


LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR trade_deficit :  -18426.1


Lower Outliers : 
+-----+------------+-----------------+
|     | country    |   trade_deficit |
|-----+------------+-----------------|
|  91 | Luxembourg |        -34650   |
| 123 | Qatar      |        -27065.5 |
+-----+------------+-----------------+


UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR trade_deficit :  2270.500000000002


Upper Outliers : 
+----+-----------+-----------------+
|    | country   |   trade_deficit |
|----+-----------+-----------------|
| 60 | Greece    |          2313.4 |
| 10 | Bahamas   |          2436   |
+----+-----------+-----------------+

Trade deficit indicates the balance between exports and imports. Negative trade deficit is favourable indicator of economic health. It means that the country has higher exports compared to imports.
When the trade deficit is negative, it means that the country is import predominant.
From the above plot, Greece and Bahamas have a trade deficit higher than 99% of the countries. This could mean a bad economic state. Since this is our key area of interest, we could leave these outliers as they are.
Countries with negative trade deficit , i.e export predominant countries could be capped to 1st percentile values.

# capping lower outliers of trade_deficit
countries.loc[l_condition,column] = countries[column].quantile(0.01)

Bivariate Analysis¶

Pairplot¶

# Pair Plots of all variables 
sns.pairplot(countries[['child_mort', 'exports', 'health', 'imports', 'income',
       'inflation', 'life_expec', 'total_fer', 'gdpp']]);

## bivariate analysis boken function
def bivariate_analysis(x_var,y_var,dataframe=countries) : 
    # Bivariate Plots with tooltips : country,x,y 

    dataframe = dataframe.copy()
    source =ColumnDataSource(dataframe)

    # pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
    pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']

    # tooltips
    tooltips1 = [
    ("Country", "@country"),
    (x_var,'@'+x_var),
    (y_var,'@'+y_var)
    ]

    p = figure(plot_width=420, plot_height=400,title='Scatter Plot : '+ x_var+' vs '+y_var, tooltips=tooltips1)

    p.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = "#3498db")
    p.xaxis.axis_label = x_var
    p.yaxis.axis_label = y_var
    reset_output()
    output_notebook()
    show(p)

Child Mortality vs Life Expectancy¶

# CHILD MORTALITY vs LIFE EXPECTANCY 

x='child_mort'
y = 'life_expec'
bivariate_analysis(x,y)

looks like child_mort and life_expec are almost linearly related.
Life expectancy decreases as child mortality increases.
One of these features is enough for analysis.

# Countries with high child mortality and low life expectancy 
life_expect_cond = countries['life_expec'] < 60 
child_mort_cond = countries['child_mort'] > 100

print('Countries with low life expectancy and low Child Mortality Rate')
tab(countries[life_expect_cond & child_mort_cond][['country','child_mort','life_expec']].sort_values(by=['child_mort','life_expec'], ascending=[False,True])[:10])

Countries with low life expectancy and low Child Mortality Rate
+-----+--------------------------+--------------+--------------+
|     | country                  |   child_mort |   life_expec |
|-----+--------------------------+--------------+--------------|
|  66 | Haiti                    |          208 |         32.1 |
| 132 | Sierra Leone             |          160 |         55   |
|  32 | Chad                     |          150 |         56.5 |
|  31 | Central African Republic |          149 |         47.5 |
|  97 | Mali                     |          137 |         59.5 |
| 112 | Niger                    |          123 |         58.8 |
|  37 | Congo, Dem. Rep.         |          116 |         57.5 |
|  25 | Burkina Faso             |          116 |         57.9 |
|  64 | Guinea-Bissau            |          114 |         55.6 |
|  40 | Cote d'Ivoire            |          111 |         56.3 |
+-----+--------------------------+--------------+--------------+

Child Mortality vs Total Fertility¶

# CHILD MORTALITY vs TOTAL FERTILITY 

x='child_mort'
y = 'total_fer'
bivariate_analysis(x,y)

# countries with high fertility and child mortality rate 
total_fer_cond = countries['total_fer'] > 6 
child_mort_cond = countries['child_mort'] > 100

print('Countries with High Total Fertility and Child Mortality Rate')
tab(countries[total_fer_cond & child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))

Countries with High Total Fertility and Child Mortality Rate
+-----+------------------+-------------+--------------+
|     | country          |   total_fer |   child_mort |
|-----+------------------+-------------+--------------|
|  32 | Chad             |        6.59 |          150 |
|  97 | Mali             |        6.55 |          137 |
| 112 | Niger            |        7.49 |          123 |
|   3 | Angola           |        6.16 |          119 |
|  37 | Congo, Dem. Rep. |        6.54 |          116 |
+-----+------------------+-------------+--------------+

# Fertility in countries with extremely high mortality rate
child_mort_cond = countries['child_mort'] > 150

print('Countries with High Total Fertility and Child Mortality Rate')
tab(countries[child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))

Countries with High Total Fertility and Child Mortality Rate
+-----+--------------+-------------+--------------+
|     | country      |   total_fer |   child_mort |
|-----+--------------+-------------+--------------|
|  66 | Haiti        |        3.33 |          208 |
| 132 | Sierra Leone |        5.2  |          160 |
+-----+--------------+-------------+--------------+

GDP per capita vs Health Spending¶

# GDP per capita vs Health Spending 
y='health'
x= 'gdpp'
bivariate_analysis(x,y)

# Countries with Low GDPP and health spending might need aid 
health_cond = countries['health'] < 200
gdpp_cond = countries['gdpp'] < 2500
print('Countries with Low GDP and low Health Spending')
tab(countries[health_cond & gdpp_cond][['country','health','gdpp']].sort_values(by=['health','gdpp'], ascending=[True,True])[:10])

Countries with Low GDP and low Health Spending
+-----+--------------------------+----------+--------+
|     | country                  |   health |   gdpp |
|-----+--------------------------+----------+--------|
|  50 | Eritrea                  |  12.8212 |    482 |
|  93 | Madagascar               |  15.5701 |    413 |
|  31 | Central African Republic |  17.7508 |    446 |
| 112 | Niger                    |  17.9568 |    348 |
| 107 | Myanmar                  |  19.4636 |    988 |
| 106 | Mozambique               |  21.8299 |    419 |
| 116 | Pakistan                 |  22.88   |   1040 |
|  37 | Congo, Dem. Rep.         |  26.4194 |    334 |
|  12 | Bangladesh               |  26.6816 |    758 |
|  26 | Burundi                  |  26.796  |    231 |
+-----+--------------------------+----------+--------+

Imports vs GDP per capita¶

# IMPORTS vs GDPP
y='imports'
x= 'gdpp'
bivariate_analysis(x,y)

Exports vs GDP per capita¶

# Exports vs GDPP
y='exports'
x= 'gdpp'
bivariate_analysis(x,y)

GDP per capita vs Trade Deficit¶

y='trade_deficit'
x= 'gdpp'
bivariate_analysis(x,y)

Trade deficit is a country's net imports minus exports.
It indicates the dependence of a country on imports
Countries with low GDP per capita and positive trade deficit are likely to require aid
From the above plot, one can see that there are many such countries

# Countries with positive high trade deficit and low GDP per capita
tr_deficit_cond = countries['trade_deficit'] > 1000
gdpp_cond = countries['gdpp'] < 10000
print('Countries with high Trade Deficit and Low GDP')
tab(countries[tr_deficit_cond & gdpp_cond][['country','gdpp','trade_deficit']].sort_values(by=['trade_deficit','gdpp'], ascending=[True,False])[:10])

Countries with high Trade Deficit and Low GDP
+-----+--------------------------------+--------+-----------------+
|     | country                        |   gdpp |   trade_deficit |
|-----+--------------------------------+--------+-----------------|
| 101 | Micronesia, Fed. Sts.          |   2860 |         1644.5  |
| 151 | Tonga                          |   3550 |         1700.45 |
| 104 | Montenegro                     |   6680 |         1716.76 |
|  61 | Grenada                        |   7370 |         1871.98 |
| 141 | St. Vincent and the Grenadines |   6230 |         1881.46 |
|  86 | Lebanon                        |   8860 |         2161.84 |
+-----+--------------------------------+--------+-----------------+

Net Income per person vs Inflation¶

# Income vs Inflation
x = 'inflation'
y = 'income'
bivariate_analysis(x,y)

# Low Income - High inflation countries might require aid 
inflation_cond = countries['inflation'] > 20
income_condition = countries['income'] < 10000
print('Countries with high inflation and low income')
tab(countries[inflation_cond & income_condition][['country','inflation','income']].sort_values(by=['inflation','income'], ascending=[False, True]))

Countries with high inflation and low income
+-----+------------------+-------------+----------+
|     | country          |   inflation |   income |
|-----+------------------+-------------+----------|
| 113 | Nigeria          |       104   |     5150 |
| 103 | Mongolia         |        39.2 |     7710 |
| 149 | Timor-Leste      |        26.5 |     1850 |
| 165 | Yemen            |        23.6 |     4480 |
| 140 | Sri Lanka        |        22.8 |     8560 |
|   3 | Angola           |        22.4 |     5900 |
|  37 | Congo, Dem. Rep. |        20.8 |      609 |
|  38 | Congo, Rep.      |        20.7 |     5190 |
+-----+------------------+-------------+----------+

Countries with High inflation and low income are possible candidates for aid requirement.

GDP per capita vs Inflation¶

x = 'gdpp'
y = 'inflation'
bivariate_analysis(x,y)

Countries with low GDP percapita and high Inflation are in dire need of support.
For example, Nigeria has an inflation > 100 while its GDP is 2330.

Correlation Analysis¶

plt.figure(figsize=[12,12])
sns.heatmap(countries.corr(),annot=True,cmap='YlGnBu', center=0)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe87cdf8510>

Top Correlations

Negative Correlation between life_expec and child_mort
Positive Correlation between total_fer and child_mort

Although clustering analysis is not affected by multicollinearity, this plot shows us the possible linear relationships between different features to help with results obtained from cluster analysis.

Hopkin's Statistic¶

# hopkins test function
from sklearn.neighbors import NearestNeighbors
from random import sample 
from numpy.random import uniform
from math import isnan 

def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

## Data used for Clustering 
columns_for_clustering = ['child_mort', 'exports', 'health', 'imports', 'income',
       'inflation', 'life_expec', 'total_fer', 'gdpp', 'trade_deficit']
clustering_data = countries[columns_for_clustering].copy()

# Hopkin's test 
n = 10
hopkins_statistic = []
for i in range(n) : 
    hopkins_statistic.append(hopkins(clustering_data))
print('Min Hopkin\'s Statistic in ',n,'iterations :', min(hopkins_statistic))
print('Max Hopkin\'s Statistic in ',n,'iterations :', max(hopkins_statistic))
print('Mean Hopkin\'s Statistic in ',n,'iterations :', np.mean(hopkins_statistic))
print('Std deviation of Hopkin\'s Statistic in ',n,'iterations :', np.std(hopkins_statistic))

Min Hopkin's Statistic in  10 iterations : 0.8594851780754723
Max Hopkin's Statistic in  10 iterations : 0.9789539440735269
Mean Hopkin's Statistic in  10 iterations : 0.9482587172561626
Std deviation of Hopkin's Statistic in  10 iterations : 0.03487962411406433

Since hopkin's statistic is greater than 80% , the data shows good clustering tendency

Standardizing Values¶

from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler()
clustering_data[columns_for_clustering] = scaler.fit_transform(clustering_data[columns_for_clustering])

tab(clustering_data.describe())

+-------+---------------+--------------+---------------+---------------+---------------+---------------+--------------+---------------+---------------+-----------------+
|       |    child_mort |      exports |        health |       imports |        income |     inflation |   life_expec |     total_fer |          gdpp |   trade_deficit |
|-------+---------------+--------------+---------------+---------------+---------------+---------------+--------------+---------------+---------------+-----------------|
| count | 167           | 167          | 167           | 167           | 167           | 167           | 167          | 167           | 167           |   167           |
| mean  |  -7.97765e-18 |   9.1743e-17 |   2.26033e-17 |   4.65363e-17 |  -7.51229e-17 |   8.31005e-17 |   3.7229e-17 |   8.24357e-17 |   8.04413e-17 |     6.18268e-17 |
| std   |   1.00301     |   1.00301    |   1.00301     |   1.00301     |   1.00301     |   1.00301     |   1.00301    |   1.00301     |   1.00301     |     1.00301     |
| min   |  -0.882217    |  -0.414037   |  -0.583254    |  -0.44916     |  -0.860326    |  -0.964355    |  -4.33969    |  -1.12962     |  -0.720789    |    -5.67561     |
| 25%   |  -0.746668    |  -0.389145   |  -0.546449    |  -0.405554    |  -0.717456    |  -0.569109    |  -0.592657   |  -0.767708    |  -0.657548    |     0.113986    |
| 50%   |  -0.47184     |  -0.31491    |  -0.410154    |  -0.309734    |  -0.373808    |  -0.228871    |   0.287671   |  -0.359318    |  -0.465925    |     0.247143    |
| 75%   |   0.592652    |  -0.00795865 |  -0.0432751   |   0.0771304   |   0.294237    |   0.280535    |   0.705262   |   0.616834    |   0.0744147   |     0.384486    |
| max   |   4.22138     |   9.83981    |   4.11998     |   9.71668     |   5.61154     |   9.14287     |   1.33391    |   3.01405     |   3.81697     |     0.997842    |
+-------+---------------+--------------+---------------+---------------+---------------+---------------+--------------+---------------+---------------+-----------------+

K-Means Clustering¶

Finding Optimal Number of Clusters¶

Elbow curve¶

# Plotting Elbow curve of Sum of Squared distances of points in each cluster from the centroid of the nearest cluster.
ssd = []
range_n_clusters = np.arange(2,9)
for num_clusters in range_n_clusters : 
    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(clustering_data)
    ssd.append(kmeans.inertia_)
plt.plot(range_n_clusters,ssd)
plt.title('Elbow Curve');
plt.xlabel('No of clusters');
plt.ylabel('SSD');

From the above Elbow curve, one can clearly see that there is a high gradient descent from k=2 to k=4 and then the curve tapers (Change in slope is not as SIGNIFICANT as earlier)
Hence, k=4 is optimum no of clusters, statistically.

Sihoutte Analysis¶

from sklearn.metrics import silhouette_score 

no_of_clusters = np.arange(2,10)
score = [] 

for n_cluster in no_of_clusters : 
    kmeans = KMeans(n_clusters=n_cluster, init='k-means++')
    kmeans = kmeans.fit(clustering_data)
    labels = kmeans.labels_
    score.append(silhouette_score(clustering_data,labels))
    

plt.title('Silhouette Analysis Plot')
plt.xlabel('No of Clusters')
plt.ylabel('Silhouette Score') 
plt.plot(no_of_clusters, score);
print(score)

[0.46329893684299267, 0.40889975765795494, 0.4113959420115351, 0.40485071262968453, 0.41404360987563693, 0.31039501753497534, 0.302668856614785, 0.30746584335288263]

Higher the silhouette score the better
However , from the above plot, we see silhouette score is the highest for k = 2, sharply falls at 3 and there's a local maximum at k=4
k = 4 seems to be the optimum no of clusters

Final k - Means Clustering¶

# k - means clustering algo with k = 4
n_cluster = 4
kmeans = KMeans(n_clusters=n_cluster, init='k-means++', random_state = 100)
kmeans = kmeans.fit(clustering_data)
labels = kmeans.labels_
countries['k_means_cluster_id'] = labels

# Countries in each Cluster - k means 
for cluster_no in range(n_cluster) : 
    condition = countries['k_means_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

CLUSTER # 0 
 ['Australia' 'Austria' 'Belgium' 'Brunei' 'Canada' 'Cyprus' 'Denmark'
 'Finland' 'France' 'Germany' 'Greece' 'Iceland' 'Ireland' 'Israel'
 'Italy' 'Japan' 'Kuwait' 'Malta' 'Netherlands' 'New Zealand' 'Norway'
 'Slovenia' 'Spain' 'Sweden' 'Switzerland' 'United Arab Emirates'
 'United Kingdom' 'United States'] 


CLUSTER # 1 
 ['Afghanistan' 'Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi'
 'Cameroon' 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Eritrea' 'Gabon'
 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau' 'Haiti' 'Iraq' 'Kenya'
 'Kiribati' 'Lao' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
 'Mauritania' 'Mozambique' 'Namibia' 'Niger' 'Nigeria' 'Pakistan' 'Rwanda'
 'Senegal' 'Sierra Leone' 'Solomon Islands' 'South Africa' 'Sudan'
 'Tanzania' 'Timor-Leste' 'Togo' 'Uganda' 'Yemen' 'Zambia'] 


CLUSTER # 2 
 ['Albania' 'Algeria' 'Antigua and Barbuda' 'Argentina' 'Armenia'
 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
 'Belize' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria'
 'Cambodia' 'Cape Verde' 'Chile' 'China' 'Colombia' 'Costa Rica' 'Croatia'
 'Czech Republic' 'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador'
 'Estonia' 'Fiji' 'Georgia' 'Grenada' 'Guatemala' 'Guyana' 'Hungary'
 'India' 'Indonesia' 'Iran' 'Jamaica' 'Jordan' 'Kazakhstan'
 'Kyrgyz Republic' 'Latvia' 'Lebanon' 'Libya' 'Lithuania' 'Macedonia, FYR'
 'Malaysia' 'Maldives' 'Mauritius' 'Micronesia, Fed. Sts.' 'Moldova'
 'Mongolia' 'Montenegro' 'Morocco' 'Myanmar' 'Nepal' 'Oman' 'Panama'
 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal' 'Romania' 'Russia'
 'Samoa' 'Saudi Arabia' 'Serbia' 'Seychelles' 'Slovak Republic'
 'South Korea' 'Sri Lanka' 'St. Vincent and the Grenadines' 'Suriname'
 'Tajikistan' 'Thailand' 'Tonga' 'Tunisia' 'Turkey' 'Turkmenistan'
 'Ukraine' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam'] 


CLUSTER # 3 
 ['Luxembourg' 'Qatar' 'Singapore']

Hierarchical Clustering¶

HAC : Single Linkage, Euclidean Measure¶

# Agglomerative Single Linkage 
mergings = linkage(clustering_data,method='single',metric='euclidean')
plt.figure(figsize=[16,10])
plt.title('Single Linkage - Hierarchical Clustering')
dendrogram(mergings);

HAC : Complete Linkage, Euclidean Measure¶

# Complete Linkage 
mergings = linkage(clustering_data,method='complete',metric='euclidean')
plt.figure(figsize=[16,10])
plt.title('Complete Linkage - Hierarchical Clustering')
dendrogram(mergings);

Hierarchical clustering with complete linkage has a more discriminative dendrogram

# Using Complete Linkage, cutting the tree for 6 clusters 
n_clusters = 6
cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
countries['hac_complete_cluster_id'] = cluster_labels

# Countries in each Cluster - Hierarchical - Complete Linkage
for cluster_no in range(n_clusters) : 
    condition = countries['hac_complete_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

CLUSTER # 0 
 ['Afghanistan' 'Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi'
 'Cameroon' 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Eritrea' 'Gabon'
 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau' 'Haiti' 'Iraq' 'Kenya'
 'Kiribati' 'Lao' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
 'Mauritania' 'Mozambique' 'Namibia' 'Niger' 'Pakistan' 'Rwanda' 'Senegal'
 'Sierra Leone' 'Solomon Islands' 'South Africa' 'Sudan' 'Tanzania'
 'Timor-Leste' 'Togo' 'Uganda' 'Yemen' 'Zambia'] 


CLUSTER # 1 
 ['Albania' 'Algeria' 'Antigua and Barbuda' 'Argentina' 'Armenia'
 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
 'Belize' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria'
 'Cambodia' 'Cape Verde' 'Chile' 'China' 'Colombia' 'Costa Rica' 'Croatia'
 'Cyprus' 'Czech Republic' 'Dominican Republic' 'Ecuador' 'Egypt'
 'El Salvador' 'Estonia' 'Fiji' 'Georgia' 'Greece' 'Grenada' 'Guatemala'
 'Guyana' 'Hungary' 'India' 'Indonesia' 'Iran' 'Israel' 'Italy' 'Jamaica'
 'Jordan' 'Kazakhstan' 'Kyrgyz Republic' 'Latvia' 'Lebanon' 'Libya'
 'Lithuania' 'Macedonia, FYR' 'Malaysia' 'Maldives' 'Malta' 'Mauritius'
 'Micronesia, Fed. Sts.' 'Moldova' 'Mongolia' 'Montenegro' 'Morocco'
 'Myanmar' 'Nepal' 'New Zealand' 'Oman' 'Panama' 'Paraguay' 'Peru'
 'Philippines' 'Poland' 'Portugal' 'Romania' 'Russia' 'Samoa'
 'Saudi Arabia' 'Serbia' 'Seychelles' 'Slovak Republic' 'Slovenia'
 'South Korea' 'Spain' 'Sri Lanka' 'St. Vincent and the Grenadines'
 'Suriname' 'Tajikistan' 'Thailand' 'Tonga' 'Tunisia' 'Turkey'
 'Turkmenistan' 'Ukraine' 'United Arab Emirates' 'Uruguay' 'Uzbekistan'
 'Vanuatu' 'Venezuela' 'Vietnam'] 


CLUSTER # 2 
 ['Australia' 'Austria' 'Belgium' 'Canada' 'Denmark' 'Finland' 'France'
 'Germany' 'Iceland' 'Ireland' 'Japan' 'Netherlands' 'Norway' 'Sweden'
 'Switzerland' 'United Kingdom' 'United States'] 


CLUSTER # 3 
 ['Brunei' 'Kuwait' 'Qatar' 'Singapore'] 


CLUSTER # 4 
 ['Luxembourg'] 


CLUSTER # 5 
 ['Nigeria']

HAC : Average Linkage, Euclidean Measure¶

# Average Linkage 
mergings = linkage(clustering_data,method='average',metric='euclidean')
plt.figure(figsize=[16,10])
plt.title('Average Linkage - Hierarchical Clustering')
dendrogram(mergings);

# Using Average Linkage, cutting the tree for 6 clusters 
n_clusters = 6
cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
countries['hac_average_cluster_id'] = cluster_labels

# Countries in each Cluster - Hierarchical - average Linkage
for cluster_no in range(n_clusters) : 
    condition = countries['hac_average_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

CLUSTER # 0 
 ['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Antigua and Barbuda'
 'Argentina' 'Armenia' 'Australia' 'Austria' 'Azerbaijan' 'Bahamas'
 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin'
 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Botswana' 'Brazil'
 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada'
 'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia'
 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Costa Rica' "Cote d'Ivoire"
 'Croatia' 'Cyprus' 'Czech Republic' 'Denmark' 'Dominican Republic'
 'Ecuador' 'Egypt' 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia'
 'Fiji' 'Finland' 'France' 'Gabon' 'Gambia' 'Georgia' 'Germany' 'Ghana'
 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau' 'Guyana'
 'Hungary' 'Iceland' 'India' 'Indonesia' 'Iran' 'Iraq' 'Israel' 'Italy'
 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya' 'Kiribati'
 'Kyrgyz Republic' 'Lao' 'Latvia' 'Lebanon' 'Lesotho' 'Liberia' 'Libya'
 'Lithuania' 'Macedonia, FYR' 'Madagascar' 'Malawi' 'Malaysia' 'Maldives'
 'Mali' 'Malta' 'Mauritania' 'Mauritius' 'Micronesia, Fed. Sts.' 'Moldova'
 'Mongolia' 'Montenegro' 'Morocco' 'Mozambique' 'Myanmar' 'Namibia'
 'Nepal' 'Netherlands' 'New Zealand' 'Niger' 'Oman' 'Pakistan' 'Panama'
 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal' 'Romania' 'Russia'
 'Rwanda' 'Samoa' 'Saudi Arabia' 'Senegal' 'Serbia' 'Seychelles'
 'Sierra Leone' 'Slovak Republic' 'Slovenia' 'Solomon Islands'
 'South Africa' 'South Korea' 'Spain' 'Sri Lanka'
 'St. Vincent and the Grenadines' 'Sudan' 'Suriname' 'Sweden' 'Tajikistan'
 'Tanzania' 'Thailand' 'Timor-Leste' 'Togo' 'Tonga' 'Tunisia' 'Turkey'
 'Turkmenistan' 'Uganda' 'Ukraine' 'United Arab Emirates' 'United Kingdom'
 'United States' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam'
 'Yemen' 'Zambia'] 


CLUSTER # 1 
 ['Brunei' 'Ireland' 'Kuwait' 'Norway' 'Qatar' 'Switzerland'] 


CLUSTER # 2 
 ['Haiti'] 


CLUSTER # 3 
 ['Luxembourg'] 


CLUSTER # 4 
 ['Nigeria'] 


CLUSTER # 5 
 ['Singapore']

HAC : Complete Linkage , Correlation Measure¶

## HAC Clustering : Dissimilarity Measure : Correlation
hac_correlation_mergings = linkage(clustering_data,method='complete', metric='correlation')
plt.figure(figsize=[12,12])
plt.title('Hierarchical Clustering : Complete Linkage, Correlation Measure')
dendrogram(hac_correlation_mergings);

n_clusters = 6
labels = cut_tree(hac_correlation_mergings, n_clusters=n_clusters)
countries['hac_correlation_cluster_id'] = labels

# HAC clustering : Correlation measure : complete distance : Countries in each cluster 

for cluster_no in range(n_clusters) : 
    condition = countries['hac_correlation_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

CLUSTER # 0 
 ['Afghanistan' 'Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi'
 'Cameroon' 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Eritrea' 'Gabon'
 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau' 'Haiti' 'India' 'Iraq' 'Kenya'
 'Kiribati' 'Lao' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
 'Mauritania' 'Mozambique' 'Myanmar' 'Namibia' 'Niger' 'Pakistan' 'Rwanda'
 'Senegal' 'Sierra Leone' 'South Africa' 'Sudan' 'Tanzania' 'Timor-Leste'
 'Togo' 'Turkmenistan' 'Uganda' 'Yemen' 'Zambia'] 


CLUSTER # 1 
 ['Albania' 'Belize' 'Cape Verde' 'Colombia' 'Dominican Republic' 'Ecuador'
 'El Salvador' 'Grenada' 'Maldives' 'Morocco' 'Panama' 'Paraguay' 'Peru'
 'St. Vincent and the Grenadines' 'Thailand' 'Tunisia'] 


CLUSTER # 2 
 ['Algeria' 'Argentina' 'Armenia' 'Azerbaijan' 'Belarus' 'Georgia' 'Iran'
 'Jamaica' 'Kazakhstan' 'Moldova' 'Mongolia' 'Nigeria' 'Russia'
 'Sri Lanka' 'Suriname' 'Ukraine' 'Venezuela' 'Vietnam'] 


CLUSTER # 3 
 ['Antigua and Barbuda' 'Australia' 'Bahamas' 'Barbados'
 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Canada' 'Chile' 'China'
 'Costa Rica' 'Croatia' 'Cyprus' 'Czech Republic' 'Estonia' 'France'
 'Greece' 'Hungary' 'Israel' 'Italy' 'Japan' 'Latvia' 'Lebanon'
 'Lithuania' 'Macedonia, FYR' 'Malaysia' 'Malta' 'Mauritius' 'Montenegro'
 'New Zealand' 'Poland' 'Portugal' 'Romania' 'Serbia' 'Seychelles'
 'Slovak Republic' 'Slovenia' 'South Korea' 'Spain' 'Turkey'
 'United Kingdom' 'United States' 'Uruguay'] 


CLUSTER # 4 
 ['Austria' 'Bahrain' 'Belgium' 'Brunei' 'Denmark' 'Finland' 'Germany'
 'Iceland' 'Ireland' 'Kuwait' 'Libya' 'Luxembourg' 'Netherlands' 'Norway'
 'Oman' 'Qatar' 'Saudi Arabia' 'Singapore' 'Sweden' 'Switzerland'
 'United Arab Emirates'] 


CLUSTER # 5 
 ['Bangladesh' 'Bhutan' 'Bolivia' 'Cambodia' 'Egypt' 'Fiji' 'Guatemala'
 'Guyana' 'Indonesia' 'Jordan' 'Kyrgyz Republic' 'Micronesia, Fed. Sts.'
 'Nepal' 'Philippines' 'Samoa' 'Solomon Islands' 'Tajikistan' 'Tonga'
 'Uzbekistan' 'Vanuatu']

Mixed K-Means Clustering¶

# Performing k-means using results of Hierarchical clustering
# 1. No of clusters of Hierarchical Clustering
# 2. Centroids obtainded from Hierarchical Clustering as the initialization points. 
clustering_data['k_means_cluster_id'] = countries['k_means_cluster_id']
clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
columns = columns_for_clustering.copy()
columns.extend(['hac_correlation_cluster_id'])
centroids = clustering_data[columns].groupby(['hac_correlation_cluster_id']).mean()
n_clusters = 6
mixed_kmeans = KMeans(n_clusters=6 , init = centroids.values, random_state=100)
results = mixed_kmeans.fit(clustering_data[columns_for_clustering])

countries['mixed_cluster_id'] = results.labels_

# Mixed clustering : Euclidean measure : k-means : Countries in each cluster 

for cluster_no in range(n_clusters) : 
    condition = countries['mixed_cluster_id'] == cluster_no 
    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

CLUSTER # 0 
 ['Afghanistan' 'Angola' 'Benin' 'Burkina Faso' 'Burundi' 'Cameroon'
 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Gambia' 'Guinea'
 'Guinea-Bissau' 'Haiti' 'Lesotho' 'Liberia' 'Malawi' 'Mali' 'Mauritania'
 'Mozambique' 'Niger' 'Sierra Leone' 'Sudan' 'Tanzania' 'Timor-Leste'
 'Togo' 'Uganda' 'Zambia'] 


CLUSTER # 1 
 ['Albania' 'Algeria' 'Antigua and Barbuda' 'Argentina' 'Armenia'
 'Azerbaijan' 'Bahrain' 'Barbados' 'Belarus' 'Belize'
 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Cape Verde' 'Chile' 'China'
 'Colombia' 'Costa Rica' 'Croatia' 'Czech Republic' 'Dominican Republic'
 'Ecuador' 'El Salvador' 'Estonia' 'Georgia' 'Grenada' 'Hungary' 'Iran'
 'Jamaica' 'Jordan' 'Kazakhstan' 'Latvia' 'Lebanon' 'Libya' 'Lithuania'
 'Macedonia, FYR' 'Malaysia' 'Maldives' 'Mauritius' 'Moldova' 'Montenegro'
 'Morocco' 'Oman' 'Panama' 'Paraguay' 'Peru' 'Poland' 'Romania' 'Russia'
 'Saudi Arabia' 'Serbia' 'Seychelles' 'Slovak Republic' 'South Korea'
 'Sri Lanka' 'St. Vincent and the Grenadines' 'Suriname' 'Thailand'
 'Tunisia' 'Turkey' 'Ukraine' 'Uruguay' 'Vietnam'] 


CLUSTER # 2 
 ['Mongolia' 'Nigeria' 'Venezuela'] 


CLUSTER # 3 
 ['Australia' 'Austria' 'Bahamas' 'Belgium' 'Canada' 'Cyprus' 'Denmark'
 'Finland' 'France' 'Germany' 'Greece' 'Iceland' 'Israel' 'Italy' 'Japan'
 'Malta' 'Netherlands' 'New Zealand' 'Portugal' 'Slovenia' 'Spain'
 'Sweden' 'United Arab Emirates' 'United Kingdom' 'United States'] 


CLUSTER # 4 
 ['Brunei' 'Ireland' 'Kuwait' 'Luxembourg' 'Norway' 'Qatar' 'Singapore'
 'Switzerland'] 


CLUSTER # 5 
 ['Bangladesh' 'Bhutan' 'Bolivia' 'Botswana' 'Cambodia' 'Egypt' 'Eritrea'
 'Fiji' 'Gabon' 'Ghana' 'Guatemala' 'Guyana' 'India' 'Indonesia' 'Iraq'
 'Kenya' 'Kiribati' 'Kyrgyz Republic' 'Lao' 'Madagascar'
 'Micronesia, Fed. Sts.' 'Myanmar' 'Namibia' 'Nepal' 'Pakistan'
 'Philippines' 'Rwanda' 'Samoa' 'Senegal' 'Solomon Islands' 'South Africa'
 'Tajikistan' 'Tonga' 'Turkmenistan' 'Uzbekistan' 'Vanuatu' 'Yemen']

# silhouette scores of all the methods
print('Mixed Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['mixed_cluster_id']))
print('K-means Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['k_means_cluster_id']))
print('Hierarchical Correlation Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['hac_correlation_cluster_id']))

Mixed Clustering 0.295025902261296
K-means Clustering 0.4113959420115351
Hierarchical Correlation Clustering 0.06816978634185641

Cluster Profiling¶

# Clustering Profling -  Plots using Bokeh

cluster_id_column = 'hierarchical-c-link-cluster-id'
title="Hierarchical Clustering"

def cluster_analysis_plot(cluster_id_column,title,x_var='income',y_var='child_mort',z_var='gdpp',dataframe=countries) : 
    # Plots 
    # works upto 6 clusters 

    dataframe = dataframe.copy()
    dataframe[cluster_id_column] = dataframe[cluster_id_column].astype('str')
    source =ColumnDataSource(dataframe)

    cluster_ids = sorted(dataframe[cluster_id_column].unique())

    # pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
    pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']

    mapper = factor_cmap(cluster_id_column,palette=pallete[:len(cluster_ids)], factors = cluster_ids)

    # plot 1
    tooltips1 = [
    ("Country", "@country"),
    (z_var,'@'+z_var),
    (x_var,'@'+x_var), 
    ('Cluster', '@'+cluster_id_column)
    ]

    p = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+x_var, tooltips=tooltips1, toolbar_location=None)
    for num,index in enumerate(cluster_ids) : 
        condition = dataframe[cluster_id_column] == index
        source = dataframe[condition]
        p.scatter(x=z_var,y=x_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1, legend_label=index)
    p.xaxis.axis_label = z_var
    p.yaxis.axis_label = x_var
    p.legend.click_policy="mute"
    # ----------------------------
    #Plot 2

    tooltips2 = [
    ("Country", "@country"),
    (z_var,'@'+z_var),
    (y_var,'@'+y_var), 
    ('Cluster', '@'+cluster_id_column)
    ]

    
    q = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+y_var, tooltips=tooltips2, toolbar_location=None)

    for num,index in enumerate(cluster_ids) : 
        condition = dataframe[cluster_id_column] == index
        source = dataframe[condition]
        q.scatter(x=z_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1,  legend_label=index)
    q.xaxis.axis_label = z_var
    q.yaxis.axis_label = y_var
    q.legend.click_policy="mute"

    # ----------------------------
    #Plot 3

    tooltips3 = [
    ("Country", "@country"),
    (x_var,'@'+x_var),
    (y_var,'@'+y_var), 
    ('Cluster', '@'+cluster_id_column)
    ]

    r = figure(plot_width=420, plot_height=400,title=title+ ' : '+ x_var+' vs '+y_var, tooltips=tooltips3, toolbar_location=None)

    for num,index in enumerate(cluster_ids) : 
        condition = dataframe[cluster_id_column] == index
        source = dataframe[condition]
        r.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , legend_label=index, muted_alpha=0.1 )

    r.xaxis.axis_label = x_var
    r.yaxis.axis_label = y_var
    r.legend.click_policy="mute"


    show(row(p,q,r))

K-Means¶

# Cluster Profiling for k-means with 4 clusters 
# hover for country names and x, y values , cluster no 
# Click on legend to selectively view clusters
cluster_analysis_plot('k_means_cluster_id','k-means clusters')

# Comparing k-means Clusters using mean values of features 
clustering_data[['child_mort','income','gdpp','k_means_cluster_id']].groupby('k_means_cluster_id').mean().plot(kind='barh')
plt.title('Comparison of Cluster Means for K-means results');

plot_columns = ['child_mort','income','gdpp']

for idx,column in enumerate(plot_columns) : 
    plt.suptitle('Comparison of Clusters Characteristics for K-means');
    plt.subplot('13'+str(idx+1))
    sns.boxplot(y=column, x='k_means_cluster_id',data=clustering_data)

plt.figure(figsize=[8,8])
pd.plotting.parallel_coordinates(clustering_data, 'k_means_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e"]);
plt.title('Parallel Coordinate Plot for K-means')
plt.xticks(rotation=45);

From the above plots, we could characterize features of each cluster as within the following levels.
- Levels : Low, Moderate, High , very high
Characteristics of each cluster

Cluster	GDP	Income	Child Mortality
0	High	High to Very High	Low
1	Low	Low	High to Very High
2	Low to Moderate	Low to Moderate	Low to Moderate
3	Very High	Very High	Low

From the characteristics , we see that the Cluster 1 is our area of interest. Lets look at the countries in Cluster 1.

# Countries in cluster with area of interest 
condition = countries['k_means_cluster_id'] == 1
countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]

Hierarchical Clustering - Complete Linkage, Correlation based distance¶

# HAC : Complete linkage, Correlation based distance :  cluster analysis plot
# hover for country names and x, y values , cluster no 
# Click on legend to selectively view clusters 
cluster_analysis_plot('hac_correlation_cluster_id','HAC CORRELATION CLUSTERS')

# Comparing Hierarchical Clusters using mean values of features 
clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
clustering_data[['child_mort','income','gdpp','hac_correlation_cluster_id']].groupby('hac_correlation_cluster_id').mean().plot(kind='barh');
plt.title('Comparison of Cluster Means for Hierarchical Clustering');

# box plots
plot_columns = ['child_mort','income','gdpp']

for idx,column in enumerate(plot_columns) : 
    plt.suptitle('Comparison of Clusters Characteristics for Hierarchical Clustering');
    plt.subplot('13'+str(idx+1))
    sns.boxplot(y=column, x='hac_correlation_cluster_id',data=clustering_data)

# Parallel Coordinates plot for Hierarchical clustering with correlation measure and complet linkage
plt.figure(figsize=[8,8])
pd.plotting.parallel_coordinates(clustering_data, 'hac_correlation_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
plt.title('Parallel Coordinate Plot for HAC with Correlation Measure and Complete Linkage')
plt.xticks(rotation=45);

From the above plots, we could characterize features of each cluster as within the following levels.
- Levels : Very low , low, Moderate, High , very high
Characteristics of each cluster

Cluster	GDP	Income	Child Mortality
0	Very low	Very Low	Very High
1	low	low	low
2	Moderate	Moderate	Moderate
3	High	High	Low
4	Very High	Very High	Low
5	Very Low	Very Low	Low

From the characteristics , we see that the Cluster 0 is our area of interest. Lets look at the countries in Cluster 0.

# Countries in cluster with area of interest 
condition = countries['hac_correlation_cluster_id'] == 0
countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]

Mixed Clustering : K-means initialized with Hierarchical Cluster Centroids¶

cluster_analysis_plot('mixed_cluster_id','MIXED K-MEANS CLUSTERS')

# Comparing MIXED k-means Clusters using mean values of features 
clustering_data['mixed_cluster_id'] = countries['mixed_cluster_id']
clustering_data[['child_mort','income','gdpp','mixed_cluster_id']].groupby('mixed_cluster_id').mean().plot(kind='barh')
plt.title('Comparison of cluster means for Mixed Clustering');

# box plots
plot_columns = ['child_mort','income','gdpp']

for idx,column in enumerate(plot_columns) : 
    plt.suptitle('Comparison of Clusters Characteristics for Mixed Clustering');
    plt.subplot('13'+str(idx+1))
    sns.boxplot(y=column, x='mixed_cluster_id',data=clustering_data)

# parallel coordinate plot
plt.figure(figsize=[8,8])
pd.plotting.parallel_coordinates(clustering_data, 'mixed_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
plt.title('Parallel Coordinate Plot for Mixed Clustering');
plt.xticks(rotation=45);

From the above plots, Cluster 0 has the very low income and gdpp and very high child mortality. Hence, Cluster 0 is the area of interest.

# Countries in cluster with area of interest 
condition = countries['mixed_cluster_id'] == 0
countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]

Hence , the top five countries to extend aid are
- Haiti
- Sierra Leone
- Chad
- Central African Republic
- Mali

Conclusion¶

Columns : 'child_mort', 'exports', 'health', 'imports', 'income','inflation', 'life_expec', 'total_fer', 'gdpp','trade_deficit' were used for clustering.
Hopkin's Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%
The dataset was standardized with mean = 0 , standard deviation = 1 before clustering.
Optimum no of clusters for k-means was found to be 4 , both from Elbow curve and Silhoeutte Analysis curve.
Clustering with k-means was performed (Cluster centers initialized with k-means++)
Hierarchical Clustering with Single, Complete and Average linkages and Euclidean , correlation based distances were explored.
Mixed k-means clustering - K-means initialized with centroids of hierarchical clustering - was also explored.
Due to more interpretable results, Hierarchical Clustering with Complete linkage and correlation based distance measure was chosen to arrive at the results.
Characteristics of Clusters obtained :

Cluster	GDP	Income	Child Mortality
0	Very low	Very Low	Very High
1	low	low	low
2	Moderate	Moderate	Moderate
3	High	High	Low
4	Very High	Very High	Low
5	Very Low	Very Low	Low

From the characteristics , we see that the Cluster 0 is our area of interest.
According to the UN goals of 2030, the top priority is health and then poverty.
Hence countries in cluster 0 were ranked based on Child Mortality followed by income and GDP per capita.
By that criteria, the following are the five countries HELP should consider extending their aid.
- Haiti
- Sierra Leone
- Chad
- Central African Republic
- Mali

	country	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
73	Ireland	4.2	50161.00	4475.53	42125.5	45700	-3.220	80.4	2.05	48700
87	Lesotho	99.7	460.98	129.87	1181.7	2380	4.150	46.5	3.30	1170
91	Luxembourg	2.8	183750.00	8158.50	149100.0	91700	3.620	81.3	1.63	105000
98	Malta	6.8	32283.00	1825.15	32494.0	28300	3.830	80.3	1.36	21100
131	Seychelles	14.4	10130.40	367.20	11664.0	20400	-4.210	73.4	2.17	10800
133	Singapore	2.8	93200.00	1845.36	81084.0	72100	-0.046	82.7	1.15	46600

	country	child_mort	income	gdpp
66	Haiti	208.0	1500	662.0
132	Sierra Leone	160.0	1220	399.0
32	Chad	150.0	1930	897.0
31	Central African Republic	149.0	888	446.0
97	Mali	137.0	1870	708.0

Table of Contents