Selecting the optimal number of clusters using Silhouette Scoring on KMeans and GMM clustering - SOLUTION¶

In this lesson, you have learned to use more advanced clustering algorithms, such as Gaussian Mixture Modeling (GMM) and you have visualized these clusters. You also learned about using internal indices, such as the Silhouette Coefficient to provide scoring of models using unlabeled data.

In the following exercise you will be putting this learning to practice:

Import libraries
You will create a Gaussian distribution sample dataset
You will create a list of the number of clusterings to test
You will fit the data to models using K-Means and GMM aglorithms
You will score the models to find which model and which number of (k) clusters produced optimum results.

At the end of all that you will also be able to visualize your clusters and Silhouette Coefficient scores. For more details on scikit-learn's silhouette_score() and its attributes, CLICK HERE.

Please note - for the following code to function as needed and visualization to run, use the variable names as written in the exercise below.¶

1. Import Libraries¶

In [1]:

Copied!





# import libraries
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_samples, silhouette_score
import scipy.stats

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
%matplotlib inline
# import libraries
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_samples, silhouette_score
import scipy.stats

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
%matplotlib inline

2. Create the Sample Datasest¶

For this exercise, you will generate a dataset with 2,000 samples (num_samples = int), 3 cluster centers (num_clusters =int), with cluster standard deviation of 1.5 (cluster_std = float). In this exercise, use a random state (random_state = int) of 42 to reproduce the results in the solution.

We can use scikit-learn's make_blobs function to create a dataset made up of Gaussian distributions.

In [2]:

Copied!





#TO DO - Create the parameters for your sample dataset

num_samples = 2000 #Create the sample dataset size  
num_k_clusters = 3   #Create clusters for the sample dataset
k_cluster_std = 1.5  #Create the standard deviation between clusters
random_state = 42  #Create the random state for reproducibility


#TO DO - Create the sample dataset using make_blobs() - see the link above

X, y = make_blobs(n_samples= num_samples, 
                  centers= num_k_clusters, 
                  cluster_std= k_cluster_std, 
                  random_state= random_state)
#TO DO - Create the parameters for your sample dataset

num_samples = 2000 #Create the sample dataset size  
num_k_clusters = 3   #Create clusters for the sample dataset
k_cluster_std = 1.5  #Create the standard deviation between clusters
random_state = 42  #Create the random state for reproducibility


#TO DO - Create the sample dataset using make_blobs() - see the link above

X, y = make_blobs(n_samples= num_samples, 
                  centers= num_k_clusters, 
                  cluster_std= k_cluster_std, 
                  random_state= random_state)

In [3]:

Copied!

#Check the size of your sample set
print(X.shape, y.shape)
#Check the size of your sample set
print(X.shape, y.shape)

(2000, 2) (2000,)

3. Create a list of cluster values (k) to evaluate¶

Using the variable range_k_clusters - create a list of type int, of the number of clusters you will be analyzing, from 2 to 6.

In [4]:

Copied!

#TO DO
range_k_clusters = [2, 3, 4, 5, 6] #Create list of ints
#TO DO
range_k_clusters = [2, 3, 4, 5, 6] #Create list of ints

4. Cluster Dataset Using KMeans and GMM¶

Initialize, fit and predict your K-Means and GMM clustering models by iterating through the list you created above. Remember to set the random_state in each model to the variable random_state above.

Initialize, fit and predict K-Means clusters for all the k-clusters in range_k_clusters¶

To review the documentation on scikit-learn's KMeans CLICK HERE

In [5]:

Copied!





kmeans_cluster_labels=[]  #This variable will hold the K-Means cluster predictions
kmeans_clusterer_list=[] #This variable will hold values needed to plot cluster centers
# Initialize the clusterer with k_clusters value from range_k_clusters

for k_clusters in range_k_clusters:
    
    #TO DO
    # Initialize KMeans cluster using n_clusters=k_clusters and random_state=rando_state
    kmeans_clusterer = KMeans(n_clusters=k_clusters, 
                              random_state=random_state)
    
    #TO DO
    # Fit dataset (X) to kmeans_clusterer
    kmeans_model = kmeans_clusterer.fit(X)
    
    #TO DO
    # Predict new dataset (X) clusters using the kmeans_model and predict()
    kmeans_prediction = kmeans_model.predict(X)
    
    #Add predictions to list called kmeans_cluster_labels
    kmeans_cluster_labels.append(kmeans_prediction)  
    
    #Add values to use to determine cluster centers for plotting
    kmeans_clusterer_list.append(kmeans_clusterer.cluster_centers_)
kmeans_cluster_labels=[]  #This variable will hold the K-Means cluster predictions
kmeans_clusterer_list=[] #This variable will hold values needed to plot cluster centers
# Initialize the clusterer with k_clusters value from range_k_clusters

for k_clusters in range_k_clusters:
    
    #TO DO
    # Initialize KMeans cluster using n_clusters=k_clusters and random_state=rando_state
    kmeans_clusterer = KMeans(n_clusters=k_clusters, 
                              random_state=random_state)
    
    #TO DO
    # Fit dataset (X) to kmeans_clusterer
    kmeans_model = kmeans_clusterer.fit(X)
    
    #TO DO
    # Predict new dataset (X) clusters using the kmeans_model and predict()
    kmeans_prediction = kmeans_model.predict(X)
    
    #Add predictions to list called kmeans_cluster_labels
    kmeans_cluster_labels.append(kmeans_prediction)  
    
    #Add values to use to determine cluster centers for plotting
    kmeans_clusterer_list.append(kmeans_clusterer.cluster_centers_)

Initialize, fit and predict GMM clustering for all the k-clusters in range_k_clusters¶

To review the docomentation on scikit-learn's GausssianMixture CLICK HERE

In [6]:

Copied!





gmm_cluster_labels=[]  #This variable will hold the GMM cluster predictions
gmm_clusterer_list =[] #This variable will hold values needed to plot cluster centers
gmm_clusterer_gen_list = [] #This variable will hold values needed to plot cluster centers


# Initialize the clusterer with k_clusters value from range_k_clusters

for k_clusters in range_k_clusters:

    #TO DO
    # Initialize Gaussian Mixture Model cluster using n_components=k_clusters and random_state=rando_state
    gmm_clusterer = GaussianMixture(n_components=k_clusters, 
                                    random_state=random_state)
    
    #TO DO
    # Fit dataset (X) to gmm_clusterer
    gmm_model = gmm_clusterer.fit(X)
    
    #TO DO
    # Predict new dataset (X) clusters using the gmm_model and predict()
    gmm_prediction =gmm_model.predict(X)
    
    #Add predictions to list called gmm_cluster_labels
    gmm_cluster_labels.append(gmm_prediction) 
    
    #Add values used to determine center of clusters when plotting
    gmm_clusterer_list.append(np.empty(shape=(gmm_clusterer.n_components, X.shape[1])))
    gmm_clusterer_gen_list.append(gmm_clusterer)
gmm_cluster_labels=[]  #This variable will hold the GMM cluster predictions
gmm_clusterer_list =[] #This variable will hold values needed to plot cluster centers
gmm_clusterer_gen_list = [] #This variable will hold values needed to plot cluster centers


# Initialize the clusterer with k_clusters value from range_k_clusters

for k_clusters in range_k_clusters:

    #TO DO
    # Initialize Gaussian Mixture Model cluster using n_components=k_clusters and random_state=rando_state
    gmm_clusterer = GaussianMixture(n_components=k_clusters, 
                                    random_state=random_state)
    
    #TO DO
    # Fit dataset (X) to gmm_clusterer
    gmm_model = gmm_clusterer.fit(X)
    
    #TO DO
    # Predict new dataset (X) clusters using the gmm_model and predict()
    gmm_prediction =gmm_model.predict(X)
    
    #Add predictions to list called gmm_cluster_labels
    gmm_cluster_labels.append(gmm_prediction) 
    
    #Add values used to determine center of clusters when plotting
    gmm_clusterer_list.append(np.empty(shape=(gmm_clusterer.n_components, X.shape[1])))
    gmm_clusterer_gen_list.append(gmm_clusterer)
    

5. Create Silhouette Coefficient Scores for your K-Means and GMM Clusters¶

The following code zips our lists of the number of clusters (range_k_clusters) and model predictions (kmeans_cluster_labels, gmm_cluster_labels). Then it iterates through the lists to match the dataset (X) with a specific model and cluster label (kmeans_cluster_label, gmm_cluster_label), given the cluster size (k_clusters).

In this section create the silhouette score for kmeans (silhouette_avg_kmeans) and GMM (silhouette_avg_gmm) using silhouette_score().

To review the documentation on scikit-learn's silhouette_score() CLICK HERE

In [7]:

Copied!





# Silhouette Score gives the average value for all the samples.
for k_clusters, kmeans_cluster_label, gmm_cluster_label in\
    zip(range_k_clusters, kmeans_cluster_labels,gmm_cluster_labels):
    
    #TO DO
    #Create the Silhouette Score for X and the KMeans cluster (kmeans_cluster_label)
    silhouette_avg_kmeans = silhouette_score(X, kmeans_cluster_label)

    #Create the Silhouette Score for X and the GMM cluster (gmm_cluster_label)
    silhouette_avg_gmm = silhouette_score(X, gmm_cluster_label)
    
    print("For k_clusters =", k_clusters,
          "The avg silhouette_score using K-Means vs GMM is:", silhouette_avg_kmeans, "vs ", silhouette_avg_gmm,
          "\n")
# Silhouette Score gives the average value for all the samples.
for k_clusters, kmeans_cluster_label, gmm_cluster_label in\
    zip(range_k_clusters, kmeans_cluster_labels,gmm_cluster_labels):
    
    #TO DO
    #Create the Silhouette Score for X and the KMeans cluster (kmeans_cluster_label)
    silhouette_avg_kmeans = silhouette_score(X, kmeans_cluster_label)

    #Create the Silhouette Score for X and the GMM cluster (gmm_cluster_label)
    silhouette_avg_gmm = silhouette_score(X, gmm_cluster_label)
    
    print("For k_clusters =", k_clusters,
          "The avg silhouette_score using K-Means vs GMM is:", silhouette_avg_kmeans, "vs ", silhouette_avg_gmm,
          "\n")

For k_clusters = 2 The avg silhouette_score using K-Means vs GMM is: 0.667709706418 vs  0.667709706418 

For k_clusters = 3 The avg silhouette_score using K-Means vs GMM is: 0.762057144271 vs  0.762057144271 

For k_clusters = 4 The avg silhouette_score using K-Means vs GMM is: 0.591360992574 vs  0.613452352497 

For k_clusters = 5 The avg silhouette_score using K-Means vs GMM is: 0.445181187607 vs  0.451278159128 

For k_clusters = 6 The avg silhouette_score using K-Means vs GMM is: 0.312978424334 vs  0.432441725452

5. Run to cell below for scatter plot visualizations of the K-Means and GMM clusters calculated above.¶

In [8]:

hide_input

Copied!





# Visualize and compare KMeans vs GMM Clusters for each value of k_clusters
#Variables centers1 and centers2 are used to plot the centers or each cluster
for k_clusters, kmeans_cluster_label, gmm_cluster_label, centers1, centers2, gmm_cluster in zip(range_k_clusters, kmeans_cluster_labels,gmm_cluster_labels,kmeans_clusterer_list, gmm_clusterer_list,gmm_clusterer_gen_list):

        
    # Plot showing the actual clusters formed
    # Create a subplot with 1 row and 2 features
    fig, [ax2,ax1] = plt.subplots(1,2)
    fig.set_size_inches(18, 7)
    
    #Plots for kmeans
    colors = cm.nipy_spectral(kmeans_cluster_label.astype(float) / k_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')

    # Labeling the clusters
    for i, c in enumerate(centers1):
        ax2.scatter(c[0], c[1], marker="o", alpha=1, s=50, color='r')

    ax2.set_title("The visualization of the K-Means clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    # plots for GMM
    colors = cm.nipy_spectral(gmm_cluster_label.astype(float) / k_clusters)
    ax1.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')
    
        
    # Measure the maximum density in GMM clusters
    for idx, i in enumerate(range(k_clusters)):
        density = scipy.stats.multivariate_normal(cov=gmm_cluster.covariances_[i], 
                                                  mean=gmm_cluster.means_[i]).logpdf(X)
        centers2[i, :] = X[np.argmax(density)]
    ax1.scatter(centers2[:, 0], centers2[:, 1], marker="o", alpha=1, s=50, color='r')
    

    ax1.set_title("The visualization of the GMM clustered data.")
    ax1.set_xlabel("Feature space for the 1st feature")
    ax1.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Scatter Plots of KMeans and GMM clustering on sample data "
                  "with k_clusters = %d" % k_clusters),
                 fontsize=14, fontweight='bold')

plt.show()
# Visualize and compare KMeans vs GMM Clusters for each value of k_clusters
#Variables centers1 and centers2 are used to plot the centers or each cluster
for k_clusters, kmeans_cluster_label, gmm_cluster_label, centers1, centers2, gmm_cluster in zip(range_k_clusters, kmeans_cluster_labels,gmm_cluster_labels,kmeans_clusterer_list, gmm_clusterer_list,gmm_clusterer_gen_list):

        
    # Plot showing the actual clusters formed
    # Create a subplot with 1 row and 2 features
    fig, [ax2,ax1] = plt.subplots(1,2)
    fig.set_size_inches(18, 7)
    
    #Plots for kmeans
    colors = cm.nipy_spectral(kmeans_cluster_label.astype(float) / k_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')

    # Labeling the clusters
    for i, c in enumerate(centers1):
        ax2.scatter(c[0], c[1], marker="o", alpha=1, s=50, color='r')

    ax2.set_title("The visualization of the K-Means clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    # plots for GMM
    colors = cm.nipy_spectral(gmm_cluster_label.astype(float) / k_clusters)
    ax1.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')
    
        
    # Measure the maximum density in GMM clusters
    for idx, i in enumerate(range(k_clusters)):
        density = scipy.stats.multivariate_normal(cov=gmm_cluster.covariances_[i], 
                                                  mean=gmm_cluster.means_[i]).logpdf(X)
        centers2[i, :] = X[np.argmax(density)]
    ax1.scatter(centers2[:, 0], centers2[:, 1], marker="o", alpha=1, s=50, color='r')
    

    ax1.set_title("The visualization of the GMM clustered data.")
    ax1.set_xlabel("Feature space for the 1st feature")
    ax1.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Scatter Plots of KMeans and GMM clustering on sample data "
                  "with k_clusters = %d" % k_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

No description has been provided for this image

In [ ]:

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search