ML/DL Notes
  • Home
  • Advacned Notebooks
    • Optional Lab: Model Evaluation and Selection
    • Optional Lab: Diagnosing Bias and Variance
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Binary
    • Optional Lab - Neurons and Layers
    • Optional Lab - Simple Neural Network
    • Optional Lab - Simple Neural Network
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Multiclass
    • Optional Lab: Back propagation using a computation graph
    • Optional Lab - Derivatives
    • Optional Lab - Multi-class Classification
    • Optional Lab - ReLU activation
    • Optional Lab - Softmax Function
    • Practice Lab: Advice for Applying Machine Learning
    • Practice Lab: Decision Trees
    • Ungraded Lab: Decision Trees
    • Ungraded Lab - Trees Ensemble
  • Advanced Learning Algorithms
    • Advanced
  • Reinforcement learning notebooks
    • Deep Q-Learning - Lunar Lander
    • State Action Value Function Example
    • Utils
  • Supervised Legacy
    • Classic_Supervised
  • Supervised Notebooks
    • Optional Lab: Cost Function
    • Optional Lab: Gradient Descent for Linear Regression
    • Optional Lab: Python, NumPy and Vectorization
    • Optional Lab: Multiple Variable Linear Regression
    • Optional Lab: Feature scaling and Learning Rate (Multi-variable)
    • Optional Lab: Feature Engineering and Polynomial Regression
    • Optional Lab: Linear Regression using Scikit-Learn
    • Practice Lab: Linear Regression
    • Optional Lab: Classification
    • Optional Lab: Logistic Regression
    • Optional Lab: Logistic Regression, Decision Boundary
    • Optional Lab: Logistic Regression, Logistic Loss
    • Optional Lab: Cost Function for Logistic Regression
    • Optional Lab: Gradient Descent for Logistic Regression
    • Ungraded Lab: Logistic Regression using Scikit-Learn
    • Ungraded Lab: Overfitting
    • Optional Lab - Regularized Cost and Gradient
    • Logistic Regression
  • Udacity
    • Changing K Solution
    • DBSCAN Lab
    • Feature Scaling Solution
    • 2. KMeans vs GMM on The Iris Dataset
    • Selecting the optimal number of clusters using Silhouette Scoring on KMeans and GMM clustering - SOLUTION
    • Implementing the Gradient Descent Algorithm
    • Hierarchical Clustering Lab
    • Independent Component Analysis Lab
    • Interpret PCA Results Solution
    • Mini Batch Gradient Descent
    • Multiple Linear Regression
    • PCA 1 Solution
    • PCA Mini Project Solution
    • Random Projection Solution
    • Predicting Student Admissions with Neural Networks
    • Perceptron algorithm
  • Unsupervised,Recommenders,Reinforcement Learning
    • Unsupervised
  • Unsupervised notebooks
    • K-means Clustering
    • Practice lab: Collaborative Filtering Recommender Systems
    • PCA - An example on Exploratory Data Analysis
    • Practice lab: Deep Learning for Content-Based Filtering
  • Previous
  • Next

Changing K - Solution¶

In this notebook, you will get some practice with different values of k, and how it changes the clusters that are observed in the data. As well as how to determine what the best value for k might be for a dataset.

To get started, let's read in our necessary libraries.

In [1]:
Copied!
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import helpers2 as h
import tests as t
from IPython import display

%matplotlib inline

# Make the images larger
plt.rcParams['figure.figsize'] = (16, 9)
import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.cluster import KMeans from sklearn.datasets import make_blobs import helpers2 as h import tests as t from IPython import display %matplotlib inline # Make the images larger plt.rcParams['figure.figsize'] = (16, 9)

1. To get started, there is a function called simulate_data within the helpers2 module. Read the documentation on the function by running the cell below. Then use the function to simulate a dataset with 200 data points (rows), 5 features (columns), and 4 centers

In [2]:
Copied!
h.simulate_data?
h.simulate_data?
In [3]:
Copied!
data = h.simulate_data(200, 5, 4)

# This will check that your dataset appears to match ours before moving forward
t.test_question_1(data)
data = h.simulate_data(200, 5, 4) # This will check that your dataset appears to match ours before moving forward t.test_question_1(data)
Looks good!  Continue!

2. Because of how you set up the data, what should the value of k be?

In [4]:
Copied!
k_value = 4

# Check your solution against ours.
t.test_question_2(k_value)
k_value = 4 # Check your solution against ours. t.test_question_2(k_value)
That's right!  The value of k is the same as the number of centroids used to create your dataset.

3. Let's try a few different values for k and fit them to our data using KMeans.

To use KMeans, you need to follow three steps:

I. Instantiate your model.

II. Fit your model to the data.

III. Predict the labels for the data.

In [5]:
Copied!
# Try instantiating a model with 4 centers
kmeans_4 = KMeans(n_clusters=4)

# Then fit the model to your data using the fit method
model_4 = kmeans_4.fit(data)

# Finally predict the labels on the same data to show the category that point belongs to
labels_4 = model_4.predict(data)

# If you did all of that correctly, this should provide a plot of your data colored by center
h.plot_data(data, labels_4)
# Try instantiating a model with 4 centers kmeans_4 = KMeans(n_clusters=4) # Then fit the model to your data using the fit method model_4 = kmeans_4.fit(data) # Finally predict the labels on the same data to show the category that point belongs to labels_4 = model_4.predict(data) # If you did all of that correctly, this should provide a plot of your data colored by center h.plot_data(data, labels_4)
No description has been provided for this image

4. Now try again, but this time fit kmeans using 2 clusters instead of 4 to your data.

In [6]:
Copied!
# Try instantiating a model with 2 centers
kmeans_2 = KMeans(n_clusters=2)

# Then fit the model to your data using the fit method
model_2 = kmeans_2.fit(data)

# Finally predict the labels on the same data to show the category that point belongs to
labels_2 = model_2.predict(data)

# If you did all of that correctly, this should provide a plot of your data colored by center
h.plot_data(data, labels_2)
# Try instantiating a model with 2 centers kmeans_2 = KMeans(n_clusters=2) # Then fit the model to your data using the fit method model_2 = kmeans_2.fit(data) # Finally predict the labels on the same data to show the category that point belongs to labels_2 = model_2.predict(data) # If you did all of that correctly, this should provide a plot of your data colored by center h.plot_data(data, labels_2)
No description has been provided for this image

5. Now try one more time, but with the number of clusters in kmeans to 7.

In [7]:
Copied!
# Try instantiating a model with 7 centers
kmeans_7 = KMeans(n_clusters=7)

# Then fit the model to your data using the fit method
model_7 = kmeans_7.fit(data)

# Finally predict the labels on the same data to show the category that point belongs to
labels_7 = model_7.predict(data)

# If you did all of that correctly, this should provide a plot of your data colored by center
h.plot_data(data, labels_7)
# Try instantiating a model with 7 centers kmeans_7 = KMeans(n_clusters=7) # Then fit the model to your data using the fit method model_7 = kmeans_7.fit(data) # Finally predict the labels on the same data to show the category that point belongs to labels_7 = model_7.predict(data) # If you did all of that correctly, this should provide a plot of your data colored by center h.plot_data(data, labels_7)
No description has been provided for this image

Visually, we get some indication of how well our model is doing, but it isn't totally apparent. Each time additional centers are considered, the distances between the points and the center will decrease. However, at some point, that decrease is not substantial enough to suggest the need for an additional cluster.

Using a scree plot is a common method for understanding if an additional cluster center is needed. The elbow method used by looking at a scree plot is still pretty subjective, but let's take a look to see how many cluster centers might be indicated.


6. Once you have fit a kmeans model to some data in sklearn, there is a score method, which takes the data. This score is an indication of how far the points are from the centroids. By fitting models for centroids from 1-10, and keeping track of the score and the number of centroids, you should be able to build a scree plot.

This plot should have the number of centroids on the x-axis, and the absolute value of the score result on the y-axis. You can see the plot I retrieved by running the solution code. Try creating your own scree plot, as you will need it for the final questions.

In [ ]:
Copied!
# A place for your work - create a scree plot - you will need to
# Fit a kmeans model with changing k from 1-10
# Obtain the score for each model (take the absolute value)
# Plot the score against k

def get_kmeans_score(data, center):
    '''
    returns the kmeans score regarding SSE for points to centers
    INPUT:
        data - the dataset you want to fit kmeans to
        center - the number of centers you want (the k value)
    OUTPUT:
        score - the SSE score for the kmeans model fit to the data
    '''
    #instantiate kmeans
    kmeans = KMeans(n_clusters=center)

    # Then fit the model to your data using the fit method
    model = kmeans.fit(data)
    
    # Obtain a score related to the model fit
    score = np.abs(model.score(data))
    
    return score

scores = []
centers = list(range(1,11))

for center in centers:
    scores.append(get_kmeans_score(data, center))
    
plt.plot(centers, scores, linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');
# A place for your work - create a scree plot - you will need to # Fit a kmeans model with changing k from 1-10 # Obtain the score for each model (take the absolute value) # Plot the score against k def get_kmeans_score(data, center): ''' returns the kmeans score regarding SSE for points to centers INPUT: data - the dataset you want to fit kmeans to center - the number of centers you want (the k value) OUTPUT: score - the SSE score for the kmeans model fit to the data ''' #instantiate kmeans kmeans = KMeans(n_clusters=center) # Then fit the model to your data using the fit method model = kmeans.fit(data) # Obtain a score related to the model fit score = np.abs(model.score(data)) return score scores = [] centers = list(range(1,11)) for center in centers: scores.append(get_kmeans_score(data, center)) plt.plot(centers, scores, linestyle='--', marker='o', color='b'); plt.xlabel('K'); plt.ylabel('SSE'); plt.title('SSE vs. K');
In [ ]:
Copied!
# Run our solution
centers, scores = h.fit_mods()

#Your plot should look similar to the below
plt.plot(centers, scores, linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');
# Run our solution centers, scores = h.fit_mods() #Your plot should look similar to the below plt.plot(centers, scores, linestyle='--', marker='o', color='b'); plt.xlabel('K'); plt.ylabel('SSE'); plt.title('SSE vs. K');

7. Using the scree plot, how many clusters would you suggest as being in the data? What is K?

In [ ]:
Copied!
value_for_k = 4

# Test your solution against ours
display.HTML(t.test_question_7(value_for_k))
value_for_k = 4 # Test your solution against ours display.HTML(t.test_question_7(value_for_k))
In [ ]:
Copied!


Documentation built with MkDocs.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search