ML/DL Notes
  • Home
  • Advacned Notebooks
    • Optional Lab: Model Evaluation and Selection
    • Optional Lab: Diagnosing Bias and Variance
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Binary
    • Optional Lab - Neurons and Layers
    • Optional Lab - Simple Neural Network
    • Optional Lab - Simple Neural Network
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Multiclass
    • Optional Lab: Back propagation using a computation graph
    • Optional Lab - Derivatives
    • Optional Lab - Multi-class Classification
    • Optional Lab - ReLU activation
    • Optional Lab - Softmax Function
    • Practice Lab: Advice for Applying Machine Learning
    • Practice Lab: Decision Trees
    • Ungraded Lab: Decision Trees
    • Ungraded Lab - Trees Ensemble
  • Advanced Learning Algorithms
    • Advanced
  • Reinforcement learning notebooks
    • Deep Q-Learning - Lunar Lander
    • State Action Value Function Example
    • Utils
  • Supervised Legacy
    • Classic_Supervised
  • Supervised Notebooks
    • Optional Lab: Cost Function
    • Optional Lab: Gradient Descent for Linear Regression
    • Optional Lab: Python, NumPy and Vectorization
    • Optional Lab: Multiple Variable Linear Regression
    • Optional Lab: Feature scaling and Learning Rate (Multi-variable)
    • Optional Lab: Feature Engineering and Polynomial Regression
    • Optional Lab: Linear Regression using Scikit-Learn
    • Practice Lab: Linear Regression
    • Optional Lab: Classification
    • Optional Lab: Logistic Regression
    • Optional Lab: Logistic Regression, Decision Boundary
    • Optional Lab: Logistic Regression, Logistic Loss
    • Optional Lab: Cost Function for Logistic Regression
    • Optional Lab: Gradient Descent for Logistic Regression
    • Ungraded Lab: Logistic Regression using Scikit-Learn
    • Ungraded Lab: Overfitting
    • Optional Lab - Regularized Cost and Gradient
    • Logistic Regression
  • Udacity
    • Changing K Solution
    • DBSCAN Lab
    • Feature Scaling Solution
    • 2. KMeans vs GMM on The Iris Dataset
    • Selecting the optimal number of clusters using Silhouette Scoring on KMeans and GMM clustering - SOLUTION
    • Implementing the Gradient Descent Algorithm
    • Hierarchical Clustering Lab
    • Independent Component Analysis Lab
    • Interpret PCA Results Solution
    • Mini Batch Gradient Descent
    • Multiple Linear Regression
    • PCA 1 Solution
    • PCA Mini Project Solution
    • Random Projection Solution
    • Predicting Student Admissions with Neural Networks
    • Perceptron algorithm
  • Unsupervised,Recommenders,Reinforcement Learning
    • Unsupervised
  • Unsupervised notebooks
    • K-means Clustering
    • Practice lab: Collaborative Filtering Recommender Systems
    • PCA - An example on Exploratory Data Analysis
    • Practice lab: Deep Learning for Content-Based Filtering
  • Previous
  • Next
  • 1. KMeans vs GMM on a Generated Dataset
  • 2. KMeans vs GMM on The Iris Dataset

1. KMeans vs GMM on a Generated Dataset¶

In the first example we'll look at, we'll generate a Gaussian dataset and attempt to cluster it and see if the clustering matches the original labels of the generated dataset.

We can use sklearn's make_blobs function to create a dataset of Gaussian blobs:

In [9]:
Copied!
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cluster, datasets, mixture

%matplotlib inline

n_samples = 1000

varied = datasets.make_blobs(n_samples=n_samples,
                             cluster_std=[5, 1, 0.5],
                             random_state=3)
X, y = varied[0], varied[1]

plt.figure( figsize=(16,12))
plt.scatter(X[:,0], X[:,1], c=y, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis'))
plt.show()
import numpy as np import matplotlib.pyplot as plt from sklearn import cluster, datasets, mixture %matplotlib inline n_samples = 1000 varied = datasets.make_blobs(n_samples=n_samples, cluster_std=[5, 1, 0.5], random_state=3) X, y = varied[0], varied[1] plt.figure( figsize=(16,12)) plt.scatter(X[:,0], X[:,1], c=y, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis')) plt.show()
No description has been provided for this image

Now when we hand off this dataset to the clustering algorithms, we obviously will not pass in the labels. So let's start with KMeans and see how it does with the dataset. WIll it be to produce clusters that match the original labels?

In [10]:
Copied!
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
pred = kmeans.fit_predict(X)
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=3) pred = kmeans.fit_predict(X)
In [11]:
Copied!
plt.figure( figsize=(16,12))
plt.scatter(X[:,0], X[:,1], c=pred, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis'))
plt.show()
plt.figure( figsize=(16,12)) plt.scatter(X[:,0], X[:,1], c=pred, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis')) plt.show()
No description has been provided for this image

How good of a job did KMeans do? Was it able to find clusters that match or are similar to the original labels?

Let us now try clustering with GaussianMixture:

In [12]:
Copied!
# TODO: Import GaussianMixture
from sklearn.mixture import GaussianMixture

# TODO: Create an instance of Gaussian Mixture with 3 components
gmm = GaussianMixture(n_components=3).fit(X)

# TODO: fit the dataset
gmm = gmm.fit(X)

# TODO: predict the clustering labels for the dataset
pred_gmm = gmm.predict(X)
# TODO: Import GaussianMixture from sklearn.mixture import GaussianMixture # TODO: Create an instance of Gaussian Mixture with 3 components gmm = GaussianMixture(n_components=3).fit(X) # TODO: fit the dataset gmm = gmm.fit(X) # TODO: predict the clustering labels for the dataset pred_gmm = gmm.predict(X)
In [13]:
Copied!
# Plot the clusters
plt.figure( figsize=(16,12))
plt.scatter(X[:,0], X[:,1], c=pred_gmm, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis'))
plt.show()
# Plot the clusters plt.figure( figsize=(16,12)) plt.scatter(X[:,0], X[:,1], c=pred_gmm, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis')) plt.show()
No description has been provided for this image

By visually comparing the result of KMeans and GMM clustering, which one was better able to match the original?

2. KMeans vs GMM on The Iris Dataset¶

For our second example, we'll take a dataset that has more than two features. The Iris dataset is great for this purpose since it is reasonable to assume it's distributed according to Gaussian distributions.

The Iris dataset is a labeled dataset with four features:

In [14]:
Copied!
import seaborn as sns

iris = sns.load_dataset("iris")

iris.head()
import seaborn as sns iris = sns.load_dataset("iris") iris.head()
Out[14]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

How do you visualize a datset with four dimensions?

There are a few ways (e.g. PairGrid, t-SNE, or project into a lower number number dimensions using PCA). Let's attempt to visualize using PairGrid because it does not distort the dataset -- it merely plots every pair of features against each other in a subplot:

In [15]:
Copied!
g = sns.PairGrid(iris, hue="species", palette=sns.color_palette("cubehelix", 3), vars=['sepal_length','sepal_width','petal_length','petal_width'])
g.map(plt.scatter)
plt.show()
g = sns.PairGrid(iris, hue="species", palette=sns.color_palette("cubehelix", 3), vars=['sepal_length','sepal_width','petal_length','petal_width']) g.map(plt.scatter) plt.show()
No description has been provided for this image

If we cluster the Iris datset using KMeans, how close would the resulting clusters match the original labels?

In [16]:
Copied!
kmeans_iris = KMeans(n_clusters=3)
pred_kmeans_iris = kmeans_iris.fit_predict(iris[['sepal_length','sepal_width','petal_length','petal_width']])
kmeans_iris = KMeans(n_clusters=3) pred_kmeans_iris = kmeans_iris.fit_predict(iris[['sepal_length','sepal_width','petal_length','petal_width']])
In [17]:
Copied!
iris['kmeans_pred'] = pred_kmeans_iris

g = sns.PairGrid(iris, hue="kmeans_pred", palette=sns.color_palette("cubehelix", 3), vars=['sepal_length','sepal_width','petal_length','petal_width'])
g.map(plt.scatter)
plt.show()
iris['kmeans_pred'] = pred_kmeans_iris g = sns.PairGrid(iris, hue="kmeans_pred", palette=sns.color_palette("cubehelix", 3), vars=['sepal_length','sepal_width','petal_length','petal_width']) g.map(plt.scatter) plt.show()
No description has been provided for this image

How do these clusters match the original labels?

You can clearly see that visual inspection is no longer useful if we're working with multiple dimensions like this. So how can we evaluate the clustering result versus the original labels?

You guessed it. We can use an external cluster validation index such as the adjusted Rand score which generates a score between -1 and 1 (where an exact match will be scored as 1).

In [18]:
Copied!
# TODO: Import adjusted rand score
from sklearn.metrics import adjusted_rand_score

# TODO: calculate adjusted rand score passing in the original labels and the kmeans predicted labels 
iris_kmeans_score = adjusted_rand_score(iris['species'], iris['kmeans_pred'])

# Print the score
iris_kmeans_score
# TODO: Import adjusted rand score from sklearn.metrics import adjusted_rand_score # TODO: calculate adjusted rand score passing in the original labels and the kmeans predicted labels iris_kmeans_score = adjusted_rand_score(iris['species'], iris['kmeans_pred']) # Print the score iris_kmeans_score
Out[18]:
0.73023827228346971

What if we cluster using Gaussian Mixture models? Would it earn a better ARI score?

In [19]:
Copied!
gmm_iris = GaussianMixture(n_components=3).fit(iris[['sepal_length','sepal_width','petal_length','petal_width']])
pred_gmm_iris = gmm_iris.predict(iris[['sepal_length','sepal_width','petal_length','petal_width']])
gmm_iris = GaussianMixture(n_components=3).fit(iris[['sepal_length','sepal_width','petal_length','petal_width']]) pred_gmm_iris = gmm_iris.predict(iris[['sepal_length','sepal_width','petal_length','petal_width']])
In [20]:
Copied!
iris['gmm_pred'] = pred_gmm_iris

# TODO: calculate adjusted rand score passing in the original 
# labels and the GMM predicted labels iris['species']
iris_gmm_score = adjusted_rand_score(iris['species'], iris['gmm_pred'])

# Print the score
iris_gmm_score
iris['gmm_pred'] = pred_gmm_iris # TODO: calculate adjusted rand score passing in the original # labels and the GMM predicted labels iris['species'] iris_gmm_score = adjusted_rand_score(iris['species'], iris['gmm_pred']) # Print the score iris_gmm_score
Out[20]:
0.90387423177481241

Thanks to ARI socres, we have a clear indicator which clustering result better matches the original dataset.


Documentation built with MkDocs.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search