Hello World

import pydendroheatmap as pdh
from __future__ import print_function
#Sklearn
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans,DBSCAN
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import time
import warnings

import numpy as np
import matplotlib.pyplot as plt

from sklearn import cluster, datasets, mixture
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from itertools import cycle, islice
import sklearn
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.cluster.hierarchy import cophenet
from scipy.cluster.hierarchy import inconsistent
from scipy.cluster.hierarchy import fcluster
import scipy.spatial as sp, scipy.cluster.hierarchy as hc


#Plot
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd
print(__doc__)
from IPython.display import HTML
path = '/media/cotran/New Volume/REU_data'
small = '/CNCAU_1403-1509_R_v1_03-03-2016/'
new_features = '/New features computing/'
folder = '/Computing_features/'
df_final = pd.read_csv(path+folder+'Courses_DataFrame.csv')
from IPython.display import Latex
#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
#Visualization
import matplotlib as mpl
import matplotlib.pylab as pylab
from pandas.tools.plotting import scatter_matrix
%matplotlib inline
plt.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 20,12
import numpy as np
from scipy.cluster.hierarchy import linkage, _LINKAGE_METHODS
from scipy.spatial.distance import pdist
from fastcluster import linkage as linkage_fc

Automatically created module for IPython interactive environment

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

df_mm = pd.read_csv('/media/cotran/New Volume/REU_data/Computing_features/Courses_minmax.csv')
df_std = pd.read_csv('/media/cotran/New Volume/REU_data/Computing_features/Courses_Standard.csv')
df_mm['course_id'] = df_final['course_id']
df_std['course_id'] = df_final['course_id']
df_mm['discipline'] = df_mm['discipline']*10
df_std['discipline'] = df_std['discipline']*10

DBSCAN - K_mean - Hierarchical Clustering Comparing Notebook

In this notebook, we will investigate applying differient clustering techniques in our data.
Basic intuition about 3 techniques that we used in this notebook:
- DBSCAN : It is a density-based clustering algorithm. The density means that given a set of points in space, it groups together points that are closely near each other. Then, it finds core samples of high density and expands clusters from them. The algorithm is very good for data which contains clusters of similar density. (outliers mean regions with low density).
- K Mean : The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.
- Hierchical Clustering: Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.
  - In this notebook, I use the Agglomerative Clustering, which performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy, specifically, I used the Ward linkage method:
    - Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
A guess : K Mean and HAC may yield the same results due to the similarity of Ward linkage and K mean.
Have a look at the dataset before clustering

df.sample(5)

	conversation_character_count	type_attach_count	conversation_count	number_of_discussion_entries	total_discus_character_count	character_per_dis_ratio	number_of_discussion_topics	total_topic_character_count	character_topic_ratio	quiz_count	quiz_avg_score	quiz_time_taken	quiz_avg_attempts	mean_final_score	student_count	wiki_view_count	wiki_total_character_count	wiki_page_count
19	8.735570e-04	0.000000	0.002814	0.003957	0.006637	0.932851	0.004098	0.016490	0.598002	0.013846	0.357279	0.000702	0.548983	0.272334	0.017024	0.005815	0.008728	0.008808
136	2.456138e-04	0.002958	0.000429	0.076005	0.096690	0.701522	0.002849	0.003269	0.169304	0.262840	0.404514	0.000397	0.140508	0.507906	0.083210	0.006661	0.002714	0.001187
62	6.287568e-08	0.000000	0.000000	0.014766	0.009079	0.303255	0.000098	0.000139	0.125507	0.214201	0.427686	0.000411	0.025084	0.128547	0.128511	0.001860	0.001498	0.000474
36	2.269598e-04	0.054539	0.000588	0.026807	0.014766	0.264349	0.004316	0.005816	0.200433	0.024970	0.363894	0.002593	0.248685	0.268063	0.030852	0.002889	0.004029	0.003271
141	2.198240e-05	0.000000	0.000160	0.018894	0.009803	0.245196	0.006182	0.006166	0.149077	0.023787	0.325931	0.001311	0.296097	0.805937	0.007868	0.007513	0.004071	0.002185

df.describe()

	conversation_character_count	type_attach_count	conversation_count	number_of_discussion_entries	total_discus_character_count	character_per_dis_ratio	number_of_discussion_topics	total_topic_character_count	character_topic_ratio	quiz_count	quiz_avg_score	quiz_time_taken	quiz_avg_attempts	mean_final_score	student_count	wiki_view_count	wiki_total_character_count	wiki_page_count
count	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000
mean	0.008450	0.013744	0.011170	0.037981	0.027017	0.315813	0.013557	0.019027	0.237731	0.070487	0.348353	0.006953	0.188794	0.254796	0.056031	0.012569	0.017818	0.013587
std	0.075038	0.077441	0.075545	0.099784	0.084923	0.187477	0.079090	0.089729	0.132570	0.117741	0.108887	0.074888	0.199360	0.196254	0.089660	0.081549	0.082071	0.078681
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000010	0.000000	0.000056	0.007832	0.003990	0.195092	0.001148	0.001558	0.163918	0.015296	0.303122	0.000443	0.025223	0.108271	0.017095	0.001861	0.001579	0.001437
50%	0.000101	0.000000	0.000353	0.016941	0.009330	0.264930	0.002840	0.004141	0.215955	0.029231	0.360231	0.000603	0.128538	0.211987	0.036074	0.003468	0.004050	0.002846
75%	0.001539	0.003242	0.005658	0.031726	0.020568	0.375194	0.005446	0.009166	0.289130	0.072130	0.414621	0.001124	0.294303	0.354888	0.068464	0.005779	0.008982	0.005989
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

Hierarchical Clustering

Using silhouette score to choose the best cut off distance.
Compute the cophenet score for the ward linkage method.
Plot the distance similarity heat map.

df = df_mm.drop(columns=['discipline','course_id']).copy()
Z = linkage(df, 'ward')

The cophenetic correlation for a cluster tree is defined as the linear correlation coefficient between the cophenetic distances obtained from the tree, and the original distances (or dissimilarities) used to construct the tree. Thus, it is a measure of how faithfully the tree represents the dissimilarities among observations.
The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.
The output value, c, is the cophenetic correlation coefficient. The magnitude of this value should be very close to 1 for a high-quality solution.

c, coph_dists = cophenet(Z, pdist(df))
print('cophenet score for the ward linkage : %f' %c)

cophenet score for the ward linkage : 0.532821

Try different parameters for the best silhouette score.

for n in np.arange(0.5,3,0.1):
    max_d = n
    clusters = fcluster(Z, max_d, criterion='distance')
    silhouette_score = metrics.silhouette_score(df, clusters,
                                      metric='euclidean')
    c, coph_dists = cophenet(Z, pdist(df))
    print('max_distance = %f,number of clusters = %f, silhouette_score = %f' %
          (max_d,clusters.max(),silhouette_score))

max_distance = 0.500000,number of clusters = 40.000000, silhouette_score = 0.191249
max_distance = 0.600000,number of clusters = 31.000000, silhouette_score = 0.192446
max_distance = 0.700000,number of clusters = 25.000000, silhouette_score = 0.183386
max_distance = 0.800000,number of clusters = 25.000000, silhouette_score = 0.183386
max_distance = 0.900000,number of clusters = 21.000000, silhouette_score = 0.161730
max_distance = 1.000000,number of clusters = 18.000000, silhouette_score = 0.150428
max_distance = 1.100000,number of clusters = 14.000000, silhouette_score = 0.188010
max_distance = 1.200000,number of clusters = 12.000000, silhouette_score = 0.182205
max_distance = 1.300000,number of clusters = 11.000000, silhouette_score = 0.181101
max_distance = 1.400000,number of clusters = 10.000000, silhouette_score = 0.260764
max_distance = 1.500000,number of clusters = 8.000000, silhouette_score = 0.257966
max_distance = 1.600000,number of clusters = 8.000000, silhouette_score = 0.257966
max_distance = 1.700000,number of clusters = 7.000000, silhouette_score = 0.254012
max_distance = 1.800000,number of clusters = 7.000000, silhouette_score = 0.254012
max_distance = 1.900000,number of clusters = 7.000000, silhouette_score = 0.254012
max_distance = 2.000000,number of clusters = 6.000000, silhouette_score = 0.252404
max_distance = 2.100000,number of clusters = 6.000000, silhouette_score = 0.252404
max_distance = 2.200000,number of clusters = 6.000000, silhouette_score = 0.252404
max_distance = 2.300000,number of clusters = 6.000000, silhouette_score = 0.252404
max_distance = 2.400000,number of clusters = 6.000000, silhouette_score = 0.252404
max_distance = 2.500000,number of clusters = 6.000000, silhouette_score = 0.252404
max_distance = 2.600000,number of clusters = 6.000000, silhouette_score = 0.252404
max_distance = 2.700000,number of clusters = 4.000000, silhouette_score = 0.196229
max_distance = 2.800000,number of clusters = 3.000000, silhouette_score = 0.194805
max_distance = 2.900000,number of clusters = 3.000000, silhouette_score = 0.194805

The cut off distance btwn 1.4 and 1.5 yields the best silhouette_score.
Let investigate on the cut off distance.

max_d = 1.4
clusters = fcluster(Z, max_d, criterion='distance')
df_ward = df.copy()
df_ward['cluster'] = clusters
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,9
g = sns.countplot(df_ward['cluster'])
g.set_title('Number of courses per cluster using Hierarchical Clustering')
df_ward['cluster'].value_counts()

   90
   31
   29
   13
    5
    4
    3
   1
    1
    1
Name: cluster, dtype: int64

png

There are 10 clusters with number of courses in range (1-90)
If we only choose cluster > 10, the result may be better

DF_corr = df.T.corr()
DF_dism = 1 - DF_corr   # distance matrix
linkage = hc.linkage(sp.distance.squareform(DF_dism), method='ward')
dis_heat_map = sns.clustermap(DF_dism, row_linkage=linkage, col_linkage=linkage)
dis_heat_map.fig.suptitle('Distance Similarity heat map using Hierarchical Clustering with Ward linkage method') 

Text(0.5,0.98,'Distance Similarity heat map using Hierarchical Clustering with Ward linkage method')

png

Heat map for the 1.4 cut off.

df_ward = df_ward.sort_values(by=['cluster'])
DF_corr = df_ward.T.corr()
DF_dism = 1 - DF_corr 
ax = sns.heatmap(DF_dism)
ax.set_title('HAC with 1.4 cut off line')

Text(0.5,1,'HAC with 1.4 cut off line')

png

K-Mean clustering

First, choose the best number of clusters based on the silhouette score.
Plot the distance similarity heat map.

for n in np.arange(3,20,1):
    estimator = KMeans(init='k-means++', n_clusters=n)
    estimator.fit(df)
    y_predict = estimator.predict(df)
    silhouette_score = metrics.silhouette_score(df, estimator.labels_,
                                      metric='euclidean')
    print('number of clusters = %f, silhouette_score = %f' %(n,silhouette_score))

number of clusters = 3.000000, silhouette_score = 0.205935
number of clusters = 4.000000, silhouette_score = 0.213508
number of clusters = 5.000000, silhouette_score = 0.234389
number of clusters = 6.000000, silhouette_score = 0.260838
number of clusters = 7.000000, silhouette_score = 0.268992
number of clusters = 8.000000, silhouette_score = 0.220120
number of clusters = 9.000000, silhouette_score = 0.218472
number of clusters = 10.000000, silhouette_score = 0.208612
number of clusters = 11.000000, silhouette_score = 0.219005
number of clusters = 12.000000, silhouette_score = 0.203963
number of clusters = 13.000000, silhouette_score = 0.193386
number of clusters = 14.000000, silhouette_score = 0.156634
number of clusters = 15.000000, silhouette_score = 0.174905
number of clusters = 16.000000, silhouette_score = 0.173752
number of clusters = 17.000000, silhouette_score = 0.168252
number of clusters = 18.000000, silhouette_score = 0.183504
number of clusters = 19.000000, silhouette_score = 0.178497

K Means with 7 clusters has the highest silhouette score - even comparing to the Ward linkage method.

estimator = KMeans(init='k-means++', n_clusters=7)
estimator.fit(df)
y_predict = estimator.predict(df)
df_k_mean = df.copy()
df_k_mean['cluster'] = y_predict
df_k_mean = df_k_mean.sort_values(by=['cluster'])
df_corr = df_k_mean.T.corr()
df_dism = 1 - df_corr   # distance matrix
ax = sns.heatmap(df_dism)
ax.set_title('K Means clustering with 7 clusters')

Text(0.5,1,'K Means clustering with 7 clusters')

png

%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,9
g = sns.countplot(df_k_mean['cluster'])
df_k_mean['cluster'].value_counts()

  86
  37
  36
  13
   4
   1
   1
Name: cluster, dtype: int64

png

Notice that:
- The number of coures per cluster can be range from 1-86.
- The clusters are well separated.
- The highest cluster has the same number of courses comparing to the Hierarchical Clustering.
- If we cut off the small cluster, the result may be the same as the Hierarchical Clustering

DBSCAN

We choose the best parameters based on the silhouette score :
- eps (The maximum distance between two samples for them to be considered as in the same neighborhood)
- min_samples (The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.)
Plot the heat map.

for m in range(3,6,1):
    for n in np.arange(0.1,0.5,0.05):
        min_samples = m
        estimator = DBSCAN(eps=n, min_samples=min_samples)
        y_predict = estimator.fit_predict(df)
        silhouette_score = metrics.silhouette_score(df, estimator.labels_,
                                          metric='euclidean')
        print('eps = %f, silhouette_score = %f, number of clusters = %d,min_samples = %d' 
              %(n,silhouette_score,len(np.unique(y_predict)),min_samples))

eps = 0.100000, silhouette_score = -0.218592, number of clusters = 3,min_samples = 3
eps = 0.150000, silhouette_score = -0.166690, number of clusters = 5,min_samples = 3
eps = 0.200000, silhouette_score = 0.182986, number of clusters = 3,min_samples = 3
eps = 0.250000, silhouette_score = 0.365152, number of clusters = 2,min_samples = 3
eps = 0.300000, silhouette_score = 0.423102, number of clusters = 2,min_samples = 3
eps = 0.350000, silhouette_score = 0.447727, number of clusters = 2,min_samples = 3
eps = 0.400000, silhouette_score = 0.473987, number of clusters = 2,min_samples = 3
eps = 0.450000, silhouette_score = 0.508696, number of clusters = 2,min_samples = 3
eps = 0.100000, silhouette_score = -0.234552, number of clusters = 4,min_samples = 4
eps = 0.150000, silhouette_score = -0.187809, number of clusters = 5,min_samples = 4
eps = 0.200000, silhouette_score = 0.134944, number of clusters = 3,min_samples = 4
eps = 0.250000, silhouette_score = 0.357429, number of clusters = 2,min_samples = 4
eps = 0.300000, silhouette_score = 0.423102, number of clusters = 2,min_samples = 4
eps = 0.350000, silhouette_score = 0.447727, number of clusters = 2,min_samples = 4
eps = 0.400000, silhouette_score = 0.473987, number of clusters = 2,min_samples = 4
eps = 0.450000, silhouette_score = 0.508696, number of clusters = 2,min_samples = 4
eps = 0.100000, silhouette_score = -0.189385, number of clusters = 2,min_samples = 5
eps = 0.150000, silhouette_score = -0.159177, number of clusters = 4,min_samples = 5
eps = 0.200000, silhouette_score = 0.116643, number of clusters = 3,min_samples = 5
eps = 0.250000, silhouette_score = 0.350799, number of clusters = 2,min_samples = 5
eps = 0.300000, silhouette_score = 0.423102, number of clusters = 2,min_samples = 5
eps = 0.350000, silhouette_score = 0.447727, number of clusters = 2,min_samples = 5
eps = 0.400000, silhouette_score = 0.473987, number of clusters = 2,min_samples = 5
eps = 0.450000, silhouette_score = 0.508696, number of clusters = 2,min_samples = 5

Although there are many high silhouette_score but the number of clusters are only 2, so we will choose to investigate 2 pair of parameters:
- eps = 0.200000, silhouette_score = 0.182986, number of clusters = 3,min_samples = 3
- eps = 0.150000, silhouette_score = -0.159177, number of clusters = 4,min_samples = 5

estimator = DBSCAN(eps=0.2, min_samples=3)
y_predict = estimator.fit_predict(df)
df_db = df.copy()
df_db['cluster'] = y_predict
df_db = df_db.sort_values(by=['cluster'])
df_db_corr = df_db.T.corr()
df_db_dism = 1 - df_db_corr   # distance matrix
sns.set(font="monospace")
ax = sns.heatmap(df_db_dism)
ax.set_title('DBSCAN with eps=0.2, min_samples=3 and 3 clusters')

Text(0.5,1,'DBSCAN with eps=0.2, min_samples=3 and 3 clusters')

png

%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,9
g = sns.countplot(df_db['cluster'])
df_db['cluster'].value_counts()

 0    125
-1     47
 1      6
Name: cluster, dtype: int64

png

estimator = DBSCAN(eps=0.15, min_samples=5)
y_predict = estimator.fit_predict(df)
df_db = df.copy()
df_db['cluster'] = y_predict
df_db = df_db.sort_values(by=['cluster'])
df_db_corr = df_db.T.corr()
df_db_dism = 1 - df_db_corr   # distance matrix
sns.set(font="monospace")
ax = sns.heatmap(df_db_dism)
ax.set_title('DBSCAN with eps=0.15, min_samples=5 and 4 clusters')

Text(0.5,1,'DBSCAN with eps=0.15, min_samples=5 and 4 clusters')

png

%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,9
g = sns.countplot(df_db['cluster'])
df_db['cluster'].value_counts()
g.set_title('DBSCAN with eps=0.15, min_samples=5 and 4 clusters, Courses per cluster distribution')

Text(0.5,1,'DBSCAN with eps=0.15, min_samples=5 and 4 clusters, Courses per cluster distribution')

png

However, the negative silhouete score is -0.159177 .Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar. Although the distance similarity heat map shows well separate areas.

Conclusion:

Based on the results we can conclude that: -DBSCAN doesnt provide the result with high number of clusters -The reason may be because there are a lot of outliers in the courses. -K Mean 7 clusters and HAC 1.4 cut off line are almost identically similar.
Next notebook, we will look deeper into the structure of clusters and its characteristics using PCA and classification approach.

Written on March 3, 2014