You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
467 KiB
467 KiB
<html>
<head>
</head>
</html>
Hierarchal Clustering¶
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
The Data¶
In [3]:
df = pd.read_csv('../DATA/cluster_mpg.csv')
In [4]:
df = df.dropna()
In [5]:
df.head()
Out[5]:
In [6]:
df.describe()
Out[6]:
In [7]:
df['origin'].value_counts()
Out[7]:
In [ ]:
In [8]:
df_w_dummies = pd.get_dummies(df.drop('name',axis=1))
In [9]:
df_w_dummies
Out[9]:
In [10]:
from sklearn.preprocessing import MinMaxScaler
In [11]:
scaler = MinMaxScaler()
In [12]:
scaled_data = scaler.fit_transform(df_w_dummies)
In [13]:
scaled_data
Out[13]:
In [14]:
scaled_df = pd.DataFrame(scaled_data,columns=df_w_dummies.columns)
In [15]:
plt.figure(figsize=(15,8))
sns.heatmap(scaled_df,cmap='magma');
In [16]:
sns.clustermap(scaled_df,row_cluster=False)
Out[16]:
In [17]:
sns.clustermap(scaled_df,col_cluster=False)
Out[17]:
Using Scikit-Learn¶
In [18]:
from sklearn.cluster import AgglomerativeClustering
In [19]:
model = AgglomerativeClustering(n_clusters=4)
In [20]:
cluster_labels = model.fit_predict(scaled_df)
In [21]:
cluster_labels
Out[21]:
In [22]:
plt.figure(figsize=(12,4),dpi=200)
sns.scatterplot(data=df,x='mpg',y='weight',hue=cluster_labels)
Out[22]:
Exploring Number of Clusters with Dendrograms¶
Make sure to read the documentation online! https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html
Assuming every point starts as its own cluster¶
In [23]:
model = AgglomerativeClustering(n_clusters=None,distance_threshold=0)
In [24]:
cluster_labels = model.fit_predict(scaled_df)
In [25]:
cluster_labels
Out[25]:
In [26]:
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchy
Linkage Model¶
In [27]:
linkage_matrix = hierarchy.linkage(model.children_)
In [28]:
linkage_matrix
Out[28]:
In [29]:
plt.figure(figsize=(20,10))
# Warning! This plot will take awhile!!
dn = hierarchy.dendrogram(linkage_matrix)
In [30]:
plt.figure(figsize=(20,10))
dn = hierarchy.dendrogram(linkage_matrix,truncate_mode='lastp',p=48)
Choosing a Threshold Distance¶
What is the distance between two points?
In [31]:
scaled_df.describe()
Out[31]:
In [32]:
scaled_df['mpg'].idxmax()
Out[32]:
In [33]:
scaled_df['mpg'].idxmin()
Out[33]:
In [34]:
# https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy
a = scaled_df.iloc[320]
b = scaled_df.iloc[28]
dist = np.linalg.norm(a-b)
In [35]:
dist
Out[35]:
Max possible distance?¶
Recall Euclidean distance: https://en.wikipedia.org/wiki/Euclidean_distance
In [36]:
np.sqrt(len(scaled_df.columns))
Out[36]:
Creating a Model Based on Distance Threshold¶
- distance_threshold
- The linkage distance threshold above which, clusters will not be merged.
In [252]:
model = AgglomerativeClustering(n_clusters=None,distance_threshold=2)
In [253]:
cluster_labels = model.fit_predict(scaled_data)
In [254]:
cluster_labels
Out[254]:
In [255]:
np.unique(cluster_labels)
Out[255]:
Linkage Matrix¶
A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.
In [256]:
linkage_matrix = hierarchy.linkage(model.children_)
In [257]:
linkage_matrix
Out[257]:
In [258]:
plt.figure(figsize=(20,10))
dn = hierarchy.dendrogram(linkage_matrix,truncate_mode='lastp',p=11)
In [ ]: