Feature Scaling Solution

Feature Scaling - Solution¶

With any distance based machine learning model (regularized regression methods, neural networks, and now kmeans), you will want to scale your data.

If you have some features that are on completely different scales, this can greatly impact the clusters you get when using K-Means.

In this notebook, you will get to see this first hand. To begin, let's read in the necessary libraries.

In [1]:

Copied!





import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import preprocessing as p

%matplotlib inline

plt.rcParams['figure.figsize'] = (16, 9)
import helpers2 as h
import tests as t


# Create the dataset for the notebook
data = h.simulate_data(200, 2, 4)
df = pd.DataFrame(data)
df.columns = ['height', 'weight']
df['height'] = np.abs(df['height']*100)
df['weight'] = df['weight'] + np.random.normal(50, 10, 200)
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import preprocessing as p

%matplotlib inline

plt.rcParams['figure.figsize'] = (16, 9)
import helpers2 as h
import tests as t


# Create the dataset for the notebook
data = h.simulate_data(200, 2, 4)
df = pd.DataFrame(data)
df.columns = ['height', 'weight']
df['height'] = np.abs(df['height']*100)
df['weight'] = df['weight'] + np.random.normal(50, 10, 200)

1. Next, take a look at the data to get familiar with it. The dataset has two columns, and it is stored in the df variable. It might be useful to get an idea of the spread in the current data, as well as a visual of the points.

In [2]:

Copied!

df.describe()
df.describe()

Out[2]:

	height	weight
count	200.000000	200.000000
mean	569.726207	53.292113
std	246.966215	11.628615
min	92.998481	16.397952
25%	357.542793	46.209965
50%	545.766752	53.742307
75%	773.310607	61.947976
max	1096.222348	82.837302

In [3]:

Copied!

plt.scatter(df['height'], df['weight']);
plt.scatter(df['height'], df['weight']);

No description has been provided for this image

Now that we've got a dataset, let's look at some options for scaling the data. As well as how the data might be scaled. There are two very common types of feature scaling that we should discuss:

I. MinMaxScaler

In some cases it is useful to think of your data in terms of the percent they are as compared to the maximum value. In these cases, you will want to use MinMaxScaler.

II. StandardScaler

Another very popular type of scaling is to scale data so that it has mean 0 and variance 1. In these cases, you will want to use StandardScaler.

It is probably more appropriate with this data to use StandardScaler. However, to get practice with feature scaling methods in python, we will perform both.

2. First let's fit the StandardScaler transformation to this dataset. I will do this one so you can see how to apply preprocessing in sklearn.

In [4]:

Copied!

df_ss = p.StandardScaler().fit_transform(df) # Fit and transform the data
df_ss = p.StandardScaler().fit_transform(df) # Fit and transform the data

In [5]:

Copied!

df_ss = pd.DataFrame(df_ss) #create a dataframe
df_ss.columns = ['height', 'weight'] #add column names again

plt.scatter(df_ss['height'], df_ss['weight']); # create a plot
df_ss = pd.DataFrame(df_ss) #create a dataframe
df_ss.columns = ['height', 'weight'] #add column names again

plt.scatter(df_ss['height'], df_ss['weight']); # create a plot

3. Now it's your turn. Try fitting the MinMaxScaler transformation to this dataset. You should be able to use the previous example to assist.

In [6]:

Copied!

df_mm = p.MinMaxScaler().fit_transform(df) # fit and transform
df_mm = p.MinMaxScaler().fit_transform(df) # fit and transform

In [7]:

Copied!

df_mm = pd.DataFrame(df_mm) #create a dataframe
df_mm.columns = ['height', 'weight'] #change the column names

plt.scatter(df_mm['height'], df_mm['weight']); #plot the data
df_mm = pd.DataFrame(df_mm) #create a dataframe
df_mm.columns = ['height', 'weight'] #change the column names

plt.scatter(df_mm['height'], df_mm['weight']); #plot the data

4. Now let's take a look at how kmeans divides the dataset into different groups for each of the different scalings of the data. Did you end up with different clusters when the data was scaled differently?

In [8]:

Copied!





def fit_kmeans(data, centers):
    '''
    INPUT:
        data = the dataset you would like to fit kmeans to (dataframe)
        centers = the number of centroids (int)
    OUTPUT:
        labels - the labels for each datapoint to which group it belongs (nparray)
    
    '''
    kmeans = KMeans(centers)
    labels = kmeans.fit_predict(data)
    return labels

labels = fit_kmeans(df, 10) #fit kmeans to get the labels
    
# Plot the original data with clusters
plt.scatter(df['height'], df['weight'], c=labels, cmap='Set1');
def fit_kmeans(data, centers):
    '''
    INPUT:
        data = the dataset you would like to fit kmeans to (dataframe)
        centers = the number of centroids (int)
    OUTPUT:
        labels - the labels for each datapoint to which group it belongs (nparray)
    
    '''
    kmeans = KMeans(centers)
    labels = kmeans.fit_predict(data)
    return labels

labels = fit_kmeans(df, 10) #fit kmeans to get the labels
    
# Plot the original data with clusters
plt.scatter(df['height'], df['weight'], c=labels, cmap='Set1');

In [9]:

Copied!

labels = fit_kmeans(df_mm, 10) #fit kmeans to get the labels
    
#plot each of the scaled datasets
plt.scatter(df_mm['height'], df_mm['weight'], c=labels, cmap='Set1');
labels = fit_kmeans(df_mm, 10) #fit kmeans to get the labels
    
#plot each of the scaled datasets
plt.scatter(df_mm['height'], df_mm['weight'], c=labels, cmap='Set1');

In [10]:

Copied!

labels = fit_kmeans(df_ss, 10)

plt.scatter(df_ss['height'], df_ss['weight'], c=labels, cmap='Set1');
labels = fit_kmeans(df_ss, 10)

plt.scatter(df_ss['height'], df_ss['weight'], c=labels, cmap='Set1');

Different from what was stated in the video - In this case, the scaling did end up changing the results. In the video, the kmeans algorithm was not refit to each differently scaled dataset. It was only using the one clustering fit on every dataset. In this notebook, you see that clustering was recomputed with each scaling, which changes the results!

In [ ]:

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search