ML/DL Notes
  • Home
  • Advacned Notebooks
    • Optional Lab: Model Evaluation and Selection
    • Optional Lab: Diagnosing Bias and Variance
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Binary
    • Optional Lab - Neurons and Layers
    • Optional Lab - Simple Neural Network
    • Optional Lab - Simple Neural Network
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Multiclass
    • Optional Lab: Back propagation using a computation graph
    • Optional Lab - Derivatives
    • Optional Lab - Multi-class Classification
    • Optional Lab - ReLU activation
    • Optional Lab - Softmax Function
    • Practice Lab: Advice for Applying Machine Learning
    • Practice Lab: Decision Trees
    • Ungraded Lab: Decision Trees
    • Ungraded Lab - Trees Ensemble
  • Advanced Learning Algorithms
    • Advanced
  • Reinforcement learning notebooks
    • Deep Q-Learning - Lunar Lander
    • State Action Value Function Example
    • Utils
  • Supervised Legacy
    • Classic_Supervised
  • Supervised Notebooks
    • Optional Lab: Cost Function
    • Optional Lab: Gradient Descent for Linear Regression
    • Optional Lab: Python, NumPy and Vectorization
    • Optional Lab: Multiple Variable Linear Regression
    • Optional Lab: Feature scaling and Learning Rate (Multi-variable)
    • Optional Lab: Feature Engineering and Polynomial Regression
    • Optional Lab: Linear Regression using Scikit-Learn
    • Practice Lab: Linear Regression
    • Optional Lab: Classification
    • Optional Lab: Logistic Regression
    • Optional Lab: Logistic Regression, Decision Boundary
    • Optional Lab: Logistic Regression, Logistic Loss
    • Optional Lab: Cost Function for Logistic Regression
    • Optional Lab: Gradient Descent for Logistic Regression
    • Ungraded Lab: Logistic Regression using Scikit-Learn
    • Ungraded Lab: Overfitting
    • Optional Lab - Regularized Cost and Gradient
    • Logistic Regression
  • Udacity
    • Changing K Solution
    • DBSCAN Lab
    • Feature Scaling Solution
    • 2. KMeans vs GMM on The Iris Dataset
    • Selecting the optimal number of clusters using Silhouette Scoring on KMeans and GMM clustering - SOLUTION
    • Implementing the Gradient Descent Algorithm
    • Hierarchical Clustering Lab
    • Independent Component Analysis Lab
    • Interpret PCA Results Solution
    • Mini Batch Gradient Descent
    • Multiple Linear Regression
    • PCA 1 Solution
    • PCA Mini Project Solution
    • Random Projection Solution
    • Predicting Student Admissions with Neural Networks
    • Perceptron algorithm
  • Unsupervised,Recommenders,Reinforcement Learning
    • Unsupervised
  • Unsupervised notebooks
    • K-means Clustering
    • Practice lab: Collaborative Filtering Recommender Systems
    • PCA - An example on Exploratory Data Analysis
    • Practice lab: Deep Learning for Content-Based Filtering
  • Previous
  • Next

Feature Scaling - Solution¶

With any distance based machine learning model (regularized regression methods, neural networks, and now kmeans), you will want to scale your data.

If you have some features that are on completely different scales, this can greatly impact the clusters you get when using K-Means.

In this notebook, you will get to see this first hand. To begin, let's read in the necessary libraries.

In [1]:
Copied!
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import preprocessing as p

%matplotlib inline

plt.rcParams['figure.figsize'] = (16, 9)
import helpers2 as h
import tests as t


# Create the dataset for the notebook
data = h.simulate_data(200, 2, 4)
df = pd.DataFrame(data)
df.columns = ['height', 'weight']
df['height'] = np.abs(df['height']*100)
df['weight'] = df['weight'] + np.random.normal(50, 10, 200)
import numpy as np import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt from sklearn import preprocessing as p %matplotlib inline plt.rcParams['figure.figsize'] = (16, 9) import helpers2 as h import tests as t # Create the dataset for the notebook data = h.simulate_data(200, 2, 4) df = pd.DataFrame(data) df.columns = ['height', 'weight'] df['height'] = np.abs(df['height']*100) df['weight'] = df['weight'] + np.random.normal(50, 10, 200)

1. Next, take a look at the data to get familiar with it. The dataset has two columns, and it is stored in the df variable. It might be useful to get an idea of the spread in the current data, as well as a visual of the points.

In [2]:
Copied!
df.describe()
df.describe()
Out[2]:
height weight
count 200.000000 200.000000
mean 569.726207 53.292113
std 246.966215 11.628615
min 92.998481 16.397952
25% 357.542793 46.209965
50% 545.766752 53.742307
75% 773.310607 61.947976
max 1096.222348 82.837302
In [3]:
Copied!
plt.scatter(df['height'], df['weight']);
plt.scatter(df['height'], df['weight']);
No description has been provided for this image

Now that we've got a dataset, let's look at some options for scaling the data. As well as how the data might be scaled. There are two very common types of feature scaling that we should discuss:

I. MinMaxScaler

In some cases it is useful to think of your data in terms of the percent they are as compared to the maximum value. In these cases, you will want to use MinMaxScaler.

II. StandardScaler

Another very popular type of scaling is to scale data so that it has mean 0 and variance 1. In these cases, you will want to use StandardScaler.

It is probably more appropriate with this data to use StandardScaler. However, to get practice with feature scaling methods in python, we will perform both.

2. First let's fit the StandardScaler transformation to this dataset. I will do this one so you can see how to apply preprocessing in sklearn.

In [4]:
Copied!
df_ss = p.StandardScaler().fit_transform(df) # Fit and transform the data
df_ss = p.StandardScaler().fit_transform(df) # Fit and transform the data
In [5]:
Copied!
df_ss = pd.DataFrame(df_ss) #create a dataframe
df_ss.columns = ['height', 'weight'] #add column names again

plt.scatter(df_ss['height'], df_ss['weight']); # create a plot
df_ss = pd.DataFrame(df_ss) #create a dataframe df_ss.columns = ['height', 'weight'] #add column names again plt.scatter(df_ss['height'], df_ss['weight']); # create a plot
No description has been provided for this image

3. Now it's your turn. Try fitting the MinMaxScaler transformation to this dataset. You should be able to use the previous example to assist.

In [6]:
Copied!
df_mm = p.MinMaxScaler().fit_transform(df) # fit and transform
df_mm = p.MinMaxScaler().fit_transform(df) # fit and transform
In [7]:
Copied!
df_mm = pd.DataFrame(df_mm) #create a dataframe
df_mm.columns = ['height', 'weight'] #change the column names

plt.scatter(df_mm['height'], df_mm['weight']); #plot the data
df_mm = pd.DataFrame(df_mm) #create a dataframe df_mm.columns = ['height', 'weight'] #change the column names plt.scatter(df_mm['height'], df_mm['weight']); #plot the data
No description has been provided for this image

4. Now let's take a look at how kmeans divides the dataset into different groups for each of the different scalings of the data. Did you end up with different clusters when the data was scaled differently?

In [8]:
Copied!
def fit_kmeans(data, centers):
    '''
    INPUT:
        data = the dataset you would like to fit kmeans to (dataframe)
        centers = the number of centroids (int)
    OUTPUT:
        labels - the labels for each datapoint to which group it belongs (nparray)
    
    '''
    kmeans = KMeans(centers)
    labels = kmeans.fit_predict(data)
    return labels

labels = fit_kmeans(df, 10) #fit kmeans to get the labels
    
# Plot the original data with clusters
plt.scatter(df['height'], df['weight'], c=labels, cmap='Set1');
def fit_kmeans(data, centers): ''' INPUT: data = the dataset you would like to fit kmeans to (dataframe) centers = the number of centroids (int) OUTPUT: labels - the labels for each datapoint to which group it belongs (nparray) ''' kmeans = KMeans(centers) labels = kmeans.fit_predict(data) return labels labels = fit_kmeans(df, 10) #fit kmeans to get the labels # Plot the original data with clusters plt.scatter(df['height'], df['weight'], c=labels, cmap='Set1');
No description has been provided for this image
In [9]:
Copied!
labels = fit_kmeans(df_mm, 10) #fit kmeans to get the labels
    
#plot each of the scaled datasets
plt.scatter(df_mm['height'], df_mm['weight'], c=labels, cmap='Set1');
labels = fit_kmeans(df_mm, 10) #fit kmeans to get the labels #plot each of the scaled datasets plt.scatter(df_mm['height'], df_mm['weight'], c=labels, cmap='Set1');
No description has been provided for this image
In [10]:
Copied!
labels = fit_kmeans(df_ss, 10)

plt.scatter(df_ss['height'], df_ss['weight'], c=labels, cmap='Set1');
labels = fit_kmeans(df_ss, 10) plt.scatter(df_ss['height'], df_ss['weight'], c=labels, cmap='Set1');
No description has been provided for this image

Different from what was stated in the video - In this case, the scaling did end up changing the results. In the video, the kmeans algorithm was not refit to each differently scaled dataset. It was only using the one clustering fit on every dataset. In this notebook, you see that clustering was recomputed with each scaling, which changes the results!

In [ ]:
Copied!


Documentation built with MkDocs.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search