You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
451 KiB
451 KiB
<html>
<head>
</head>
</html>
DBSCAN Hyperparameters¶
Let's explore the hyperparameters for DBSCAN and how they can change results!
DBSCAN and Clustering Examples¶
In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [34]:
two_blobs = pd.read_csv('../DATA/cluster_two_blobs.csv')
two_blobs_outliers = pd.read_csv('../DATA/cluster_two_blobs_outliers.csv')
In [36]:
sns.scatterplot(data=two_blobs,x='X1',y='X2')
Out[36]:
In [37]:
sns.scatterplot(data=two_blobs_outliers,x='X1',y='X2')
Out[37]:
Label Discovery¶
In [75]:
def display_categories(model,data):
labels = model.fit_predict(data)
sns.scatterplot(data=data,x='X1',y='X2',hue=labels,palette='Set1')
DBSCAN¶
In [76]:
from sklearn.cluster import DBSCAN
In [77]:
help(DBSCAN)
In [78]:
dbscan = DBSCAN()
In [79]:
display_categories(dbscan,two_blobs)
In [80]:
display_categories(dbscan,two_blobs_outliers)
Epsilon¶
eps : float, default=0.5
| The maximum distance between two samples for one to be considered
| as in the neighborhood of the other. This is not a maximum bound
| on the distances of points within a cluster. This is the most
| important DBSCAN parameter to choose appropriately for your data set
| and distance function.
In [81]:
# Tiny Epsilon --> Tiny Max Distance --> Everything is an outlier (class=-1)
dbscan = DBSCAN(eps=0.001)
display_categories(dbscan,two_blobs_outliers)
In [82]:
# Huge Epsilon --> Huge Max Distance --> Everything is in the same cluster (class=0)
dbscan = DBSCAN(eps=10)
display_categories(dbscan,two_blobs_outliers)
In [166]:
# How to find a good epsilon?
dbscan = DBSCAN(eps=1)
display_categories(dbscan,two_blobs_outliers)
In [51]:
dbscan.labels_
Out[51]:
In [52]:
dbscan.labels_ == -1
Out[52]:
In [54]:
np.sum(dbscan.labels_ == -1)
Out[54]:
In [57]:
100 * np.sum(dbscan.labels_ == -1) / len(dbscan.labels_)
Out[57]:
Charting reasonable Epsilon values¶
In [159]:
# bend the knee! https://raghavan.usc.edu/papers/kneedle-simplex11.pdf
In [170]:
# np.arange(start=0.01,stop=10,step=0.01)
In [189]:
outlier_percent = []
number_of_outliers = []
for eps in np.linspace(0.001,10,100):
# Create Model
dbscan = DBSCAN(eps=eps)
dbscan.fit(two_blobs_outliers)
# Log Number of Outliers
number_of_outliers.append(np.sum(dbscan.labels_ == -1))
# Log percentage of points that are outliers
perc_outliers = 100 * np.sum(dbscan.labels_ == -1) / len(dbscan.labels_)
outlier_percent.append(perc_outliers)
In [190]:
sns.lineplot(x=np.linspace(0.001,10,100),y=outlier_percent)
plt.ylabel("Percentage of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
Out[190]:
In [192]:
sns.lineplot(x=np.linspace(0.001,10,100),y=number_of_outliers)
plt.ylabel("Number of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
plt.xlim(0,1)
Out[192]:
Do we want to think in terms of percentage targeting instead?¶
If so, you could "target" a percentage, like choose a range producing 1%-5% as outliers.
In [193]:
sns.lineplot(x=np.linspace(0.001,10,100),y=outlier_percent)
plt.ylabel("Percentage of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
plt.ylim(0,5)
plt.xlim(0,2)
plt.hlines(y=1,xmin=0,xmax=2,colors='red',ls='--')
Out[193]:
In [194]:
# How to find a good epsilon?
dbscan = DBSCAN(eps=0.4)
display_categories(dbscan,two_blobs_outliers)
Do we want to think in terms of number of outliers targeting instead?¶
If so, you could "target" a number of outliers, such as 3 points as outliers.
In [203]:
sns.lineplot(x=np.linspace(0.001,10,100),y=number_of_outliers)
plt.ylabel("Number of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
plt.ylim(0,10)
plt.xlim(0,6)
plt.hlines(y=3,xmin=0,xmax=10,colors='red',ls='--')
Out[203]:
In [204]:
# How to find a good epsilon?
dbscan = DBSCAN(eps=0.75)
display_categories(dbscan,two_blobs_outliers)
Minimum Samples¶
| min_samples : int, default=5
| The number of samples (or total weight) in a neighborhood for a point
| to be considered as a core point. This includes the point itself.
How to choose minimum number of points?
https://stats.stackexchange.com/questions/88872/a-routine-to-choose-eps-and-minpts-for-dbscan
In [218]:
outlier_percent = []
for n in np.arange(1,100):
# Create Model
dbscan = DBSCAN(min_samples=n)
dbscan.fit(two_blobs_outliers)
# Log percentage of points that are outliers
perc_outliers = 100 * np.sum(dbscan.labels_ == -1) / len(dbscan.labels_)
outlier_percent.append(perc_outliers)
In [226]:
sns.lineplot(x=np.arange(1,100),y=outlier_percent)
plt.ylabel("Percentage of Points Classified as Outliers")
plt.xlabel("Minimum Number of Samples")
Out[226]:
In [229]:
num_dim = two_blobs_outliers.shape[1]
dbscan = DBSCAN(min_samples=2*num_dim)
display_categories(dbscan,two_blobs_outliers)
In [230]:
num_dim = two_blobs_outliers.shape[1]
dbscan = DBSCAN(eps=0.75,min_samples=2*num_dim)
display_categories(dbscan,two_blobs_outliers)
In [231]:
dbscan = DBSCAN(min_samples=1)
display_categories(dbscan,two_blobs_outliers)
In [232]:
dbscan = DBSCAN(eps=0.75,min_samples=1)
display_categories(dbscan,two_blobs_outliers)