ML/DL Notes
  • Home
  • Advacned Notebooks
    • Optional Lab: Model Evaluation and Selection
    • Optional Lab: Diagnosing Bias and Variance
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Binary
    • Optional Lab - Neurons and Layers
    • Optional Lab - Simple Neural Network
    • Optional Lab - Simple Neural Network
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Multiclass
    • Optional Lab: Back propagation using a computation graph
    • Optional Lab - Derivatives
    • Optional Lab - Multi-class Classification
    • Optional Lab - ReLU activation
    • Optional Lab - Softmax Function
    • Practice Lab: Advice for Applying Machine Learning
    • Practice Lab: Decision Trees
    • Ungraded Lab: Decision Trees
    • Ungraded Lab - Trees Ensemble
  • Advanced Learning Algorithms
    • Advanced
  • Reinforcement learning notebooks
    • Deep Q-Learning - Lunar Lander
    • State Action Value Function Example
    • Utils
  • Supervised Legacy
    • Classic_Supervised
  • Supervised Notebooks
    • Optional Lab: Cost Function
    • Optional Lab: Gradient Descent for Linear Regression
    • Optional Lab: Python, NumPy and Vectorization
    • Optional Lab: Multiple Variable Linear Regression
    • Optional Lab: Feature scaling and Learning Rate (Multi-variable)
    • Optional Lab: Feature Engineering and Polynomial Regression
    • Optional Lab: Linear Regression using Scikit-Learn
    • Practice Lab: Linear Regression
    • Optional Lab: Classification
    • Optional Lab: Logistic Regression
    • Optional Lab: Logistic Regression, Decision Boundary
    • Optional Lab: Logistic Regression, Logistic Loss
    • Optional Lab: Cost Function for Logistic Regression
    • Optional Lab: Gradient Descent for Logistic Regression
    • Ungraded Lab: Logistic Regression using Scikit-Learn
    • Ungraded Lab: Overfitting
    • Optional Lab - Regularized Cost and Gradient
    • Logistic Regression
  • Udacity
    • Changing K Solution
    • DBSCAN Lab
    • Feature Scaling Solution
    • 2. KMeans vs GMM on The Iris Dataset
    • Selecting the optimal number of clusters using Silhouette Scoring on KMeans and GMM clustering - SOLUTION
    • Implementing the Gradient Descent Algorithm
    • Hierarchical Clustering Lab
    • Independent Component Analysis Lab
    • Interpret PCA Results Solution
    • Mini Batch Gradient Descent
    • Multiple Linear Regression
    • PCA 1 Solution
    • PCA Mini Project Solution
    • Random Projection Solution
    • Predicting Student Admissions with Neural Networks
    • Perceptron algorithm
  • Unsupervised,Recommenders,Reinforcement Learning
    • Unsupervised
  • Unsupervised notebooks
    • K-means Clustering
    • Practice lab: Collaborative Filtering Recommender Systems
    • PCA - An example on Exploratory Data Analysis
    • Practice lab: Deep Learning for Content-Based Filtering
  • Previous
  • Next
  • DBSCAN Lab
    • Dataset 2
    • Heuristics for experimenting with DBSCAN's parameters

DBSCAN Lab¶

In this notebook, we will use DBSCAN to cluster a couple of datasests. We will examine how changing its parameters (epsilon and min_samples) changes the resulting cluster structure.

In [1]:
Copied!
import pandas as pd
dataset_1 = pd.read_csv('blobs.csv')[:80].values
import pandas as pd dataset_1 = pd.read_csv('blobs.csv')[:80].values

This our first dataset. It looks like this:

In [2]:
Copied!
%matplotlib inline

import dbscan_lab_helper as helper
    
helper.plot_dataset(dataset_1)
%matplotlib inline import dbscan_lab_helper as helper helper.plot_dataset(dataset_1)
No description has been provided for this image

Let's cluster it using DBSCAN's default settings and see what happens. We are hoping for it to be able to assign each of the three "blobs" into its own cluster. Can it do that out of the box?

In [3]:
Copied!
#TODO: Import sklearn's cluster module
from sklearn import cluster

#TODO: create an instance of DBSCAN
dbscan = cluster.DBSCAN()
#TODO: use DBSCAN's fit_predict to return clustering labels for dataset_1
clustering_labels_1 = dbscan.fit_predict(dataset_1)
#TODO: Import sklearn's cluster module from sklearn import cluster #TODO: create an instance of DBSCAN dbscan = cluster.DBSCAN() #TODO: use DBSCAN's fit_predict to return clustering labels for dataset_1 clustering_labels_1 = dbscan.fit_predict(dataset_1)
In [4]:
Copied!
# Plot clustering
helper.plot_clustered_dataset(dataset_1, clustering_labels_1)
# Plot clustering helper.plot_clustered_dataset(dataset_1, clustering_labels_1)
No description has been provided for this image

Does that look okay? Was it able to group the dataset into the three clusters we were hoping for?

As you see, we will have to make some tweaks. Let's start by looking at Epsilon, the radius of each point's neighborhood. The default value in sklearn is 0.5.

In [5]:
Copied!
# Plot clustering with neighborhoods
helper.plot_clustered_dataset(dataset_1, clustering_labels_1, neighborhood=True)
# Plot clustering with neighborhoods helper.plot_clustered_dataset(dataset_1, clustering_labels_1, neighborhood=True)
No description has been provided for this image

From the graph, we can see that an Epsilon value of 0.5 is too small for this dataset. We need to increase it so the points in a blob overlap each others' neighborhoods, but not to the degree where a single cluster would span two blobs.

Quiz: Change the value of Epsilon so that each blob is its own cluster (without any noise points). The graph shows the points in the datasets as well as the neighborhood of each point:

In [6]:
Copied!
# TODO: increase the value of epsilon to allow DBSCAN to find three clusters in the dataset
epsilon=2

# Cluster
dbscan = cluster.DBSCAN(eps=epsilon)
clustering_labels_2 = dbscan.fit_predict(dataset_1)

# Plot
helper.plot_clustered_dataset(dataset_1, clustering_labels_2, neighborhood=True, epsilon=epsilon)
# TODO: increase the value of epsilon to allow DBSCAN to find three clusters in the dataset epsilon=2 # Cluster dbscan = cluster.DBSCAN(eps=epsilon) clustering_labels_2 = dbscan.fit_predict(dataset_1) # Plot helper.plot_clustered_dataset(dataset_1, clustering_labels_2, neighborhood=True, epsilon=epsilon)
No description has been provided for this image

Were you able to do it? As you change the values, you can see that the points cluster into larger clusters and the number of noise points keeps on decreasing. Then at Epsilon values above 1.6 we get the clustering we're after. But once we increase it to above 5, we start to see two blobs joining together into one cluster. So the right Epsilon would be in the range between those values in this scenario.

Dataset 2¶

Let's now look at a dataset that's a little more tricky

In [7]:
Copied!
dataset_2 = pd.read_csv('varied.csv')[:300].values
dataset_2 = pd.read_csv('varied.csv')[:300].values
In [8]:
Copied!
# Plot
helper.plot_dataset(dataset_2, xlim=(-14, 5), ylim=(-12, 7))
# Plot helper.plot_dataset(dataset_2, xlim=(-14, 5), ylim=(-12, 7))
No description has been provided for this image

What happens if we run DBSCAN with the default parameter values?

In [9]:
Copied!
# Cluster with DBSCAN
# TODO: Create a new isntance of DBSCAN
dbscan = cluster.DBSCAN()
# TODO: use DBSCAN's fit_predict to return clustering labels for dataset_2
clustering_labels_3 = dbscan.fit_predict(dataset_2)
# Cluster with DBSCAN # TODO: Create a new isntance of DBSCAN dbscan = cluster.DBSCAN() # TODO: use DBSCAN's fit_predict to return clustering labels for dataset_2 clustering_labels_3 = dbscan.fit_predict(dataset_2)
In [10]:
Copied!
# Plot
helper.plot_clustered_dataset(dataset_2, 
                              clustering_labels_3, 
                              xlim=(-14, 5), 
                              ylim=(-12, 7), 
                              neighborhood=True, 
                              epsilon=0.5)
# Plot helper.plot_clustered_dataset(dataset_2, clustering_labels_3, xlim=(-14, 5), ylim=(-12, 7), neighborhood=True, epsilon=0.5)
No description has been provided for this image

This clustering could make sense in some scenarios, but it seems rather arbitrary. Looking at the dataset, we can imagine at least two scenarios for what we'd want to do:

  • Scenario 1: Break the dataset up into three clusters: the blob on the left, the blob on the right, and the central area (even though it's less dense than the blobs on either side).
  • Scenario 2: Break the dataset up into two clusters: the blob on the left, and the blob on the right. Marking all the points in the center as noise.

What values for the DBSCAN parameters would allow us to satisfy each of those senarios? Try a number of parameters to see if you can find a clustering that makes more sense.

In [11]:
Copied!
# TODO: Experiment with different values for eps and min_samples to find a suitable clustering for the dataset
eps=1.32
min_samples=50

# Cluster with DBSCAN
dbscan = cluster.DBSCAN(eps=eps, min_samples=min_samples)
clustering_labels_4 = dbscan.fit_predict(dataset_2)

# Plot
helper.plot_clustered_dataset(dataset_2, 
                              clustering_labels_4, 
                              xlim=(-14, 5), 
                              ylim=(-12, 7), 
                              neighborhood=True, 
                              epsilon=0.5)
# TODO: Experiment with different values for eps and min_samples to find a suitable clustering for the dataset eps=1.32 min_samples=50 # Cluster with DBSCAN dbscan = cluster.DBSCAN(eps=eps, min_samples=min_samples) clustering_labels_4 = dbscan.fit_predict(dataset_2) # Plot helper.plot_clustered_dataset(dataset_2, clustering_labels_4, xlim=(-14, 5), ylim=(-12, 7), neighborhood=True, epsilon=0.5)
No description has been provided for this image

The following grid plots the DBSCAN clustering results of a range of parameter values. Epsilon varies horizontally, while vertically each row shows a different value of min_samples.

In [12]:
Copied!
eps_values = [0.3, 0.5, 1, 1.3, 1.5]
min_samples_values = [2, 5, 10, 20, 80]

helper.plot_dbscan_grid(dataset_2, eps_values, min_samples_values)
eps_values = [0.3, 0.5, 1, 1.3, 1.5] min_samples_values = [2, 5, 10, 20, 80] helper.plot_dbscan_grid(dataset_2, eps_values, min_samples_values)
No description has been provided for this image

Heuristics for experimenting with DBSCAN's parameters¶

Looking at this grid, we can guess at some general heuristics for tweaking the parameters of DBSCAN:

| |Epsilon too low|Epsilon too high| |---|---|---|---| |min_samples too low |No description has been provided for this image
Many small clusters. More than anticipated for the dataset.
Action: increase min_samples and epsilon| No description has been provided for this image
Most points belong to one cluster
Action: decrease epsilon and increase min_samples| |min_samples too high|No description has been provided for this image
Most/all data points are labeled as noise
Action: increase epsilon and decrease min_sample| No description has been provided for this image
Except for extremely dense regions, most/all data points are
labeled as noise. (Or all points are labeled as noise).
Action: decrease min_samples and epsilon.|

Quiz¶

  • Which values do you believe best satisfy scenario 1?
  • Which values do you believe best satisfy scenario 2?

Answers:¶

  • Epsilon=1.3, min_samples=5 seems to do a good job here. There are other similar ones as well (1,2), for example.
  • Epsilon=1.3, min_samples=20 does the best job to satisfy scenario 2

Documentation built with MkDocs.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search