You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
467 KiB
467 KiB
<html>
<head>
</head>
</html>
Hierarchal Clustering¶
In [217]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
The Data¶
In [218]:
df = pd.read_csv('../DATA/cluster_mpg.csv')
In [219]:
df = df.dropna()
In [220]:
df.head()
Out[220]:
In [221]:
df.describe()
Out[221]:
In [222]:
df['origin'].value_counts()
Out[222]:
In [ ]:
In [223]:
df_w_dummies = pd.get_dummies(df.drop('name',axis=1))
In [224]:
df_w_dummies
Out[224]:
In [225]:
from sklearn.preprocessing import MinMaxScaler
In [226]:
scaler = MinMaxScaler()
In [227]:
scaled_data = scaler.fit_transform(df_w_dummies)
In [228]:
scaled_data
Out[228]:
In [229]:
scaled_df = pd.DataFrame(scaled_data,columns=df_w_dummies.columns)
In [230]:
plt.figure(figsize=(15,8))
sns.heatmap(scaled_df,cmap='magma');
In [231]:
sns.clustermap(scaled_df,row_cluster=False)
Out[231]:
In [232]:
sns.clustermap(scaled_df,col_cluster=False)
Out[232]:
Using Scikit-Learn¶
In [233]:
from sklearn.cluster import AgglomerativeClustering
In [234]:
model = AgglomerativeClustering(n_clusters=4)
In [235]:
cluster_labels = model.fit_predict(scaled_df)
In [236]:
cluster_labels
Out[236]:
In [237]:
plt.figure(figsize=(12,4),dpi=200)
sns.scatterplot(data=df,x='mpg',y='weight',hue=cluster_labels)
Out[237]:
Exploring Number of Clusters with Dendograms¶
Make sure to read the documentation online! https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html
Assuming every point starts as its own cluster¶
In [238]:
model = AgglomerativeClustering(n_clusters=None,distance_threshold=0)
In [239]:
cluster_labels = model.fit_predict(scaled_df)
In [240]:
cluster_labels
Out[240]:
In [241]:
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchy
Linkage Model¶
In [242]:
linkage_matrix = hierarchy.linkage(model.children_)
In [243]:
linkage_matrix
Out[243]:
In [244]:
plt.figure(figsize=(20,10))
# Warning! This plot will take awhile!!
dn = hierarchy.dendrogram(linkage_matrix)
In [245]:
plt.figure(figsize=(20,10))
dn = hierarchy.dendrogram(linkage_matrix,truncate_mode='lastp',p=48)
Choosing a Threshold Distance¶
What is the distance between two points?
In [246]:
scaled_df.describe()
Out[246]:
In [247]:
scaled_df['mpg'].idxmax()
Out[247]:
In [248]:
scaled_df['mpg'].idxmin()
Out[248]:
In [249]:
# https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy
a = scaled_df.iloc[320]
b = scaled_df.iloc[28]
dist = np.linalg.norm(a-b)
In [250]:
dist
Out[250]:
Max possible distance?¶
Recall Euclidean distance: https://en.wikipedia.org/wiki/Euclidean_distance
In [251]:
len(scaled_df.columns)
Out[251]:
Creating a Model Based on Distance Threshold¶
- distance_threshold
- The linkage distance threshold above which, clusters will not be merged.
In [252]:
model = AgglomerativeClustering(n_clusters=None,distance_threshold=2)
In [253]:
cluster_labels = model.fit_predict(scaled_data)
In [254]:
cluster_labels
Out[254]:
In [255]:
np.unique(cluster_labels)
Out[255]:
Linkage Matrix¶
A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.
In [256]:
linkage_matrix = hierarchy.linkage(model.children_)
In [257]:
linkage_matrix
Out[257]:
In [258]:
plt.figure(figsize=(20,10))
dn = hierarchy.dendrogram(linkage_matrix,truncate_mode='lastp',p=11)
In [ ]: