ML/DL Notes
  • Home
  • Advacned Notebooks
    • Optional Lab: Model Evaluation and Selection
    • Optional Lab: Diagnosing Bias and Variance
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Binary
    • Optional Lab - Neurons and Layers
    • Optional Lab - Simple Neural Network
    • Optional Lab - Simple Neural Network
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Multiclass
    • Optional Lab: Back propagation using a computation graph
    • Optional Lab - Derivatives
    • Optional Lab - Multi-class Classification
    • Optional Lab - ReLU activation
    • Optional Lab - Softmax Function
    • Practice Lab: Advice for Applying Machine Learning
    • Practice Lab: Decision Trees
    • Ungraded Lab: Decision Trees
    • Ungraded Lab - Trees Ensemble
  • Advanced Learning Algorithms
    • Advanced
  • Reinforcement learning notebooks
    • Deep Q-Learning - Lunar Lander
    • State Action Value Function Example
    • Utils
  • Supervised Legacy
    • Classic_Supervised
  • Supervised Notebooks
    • Optional Lab: Cost Function
    • Optional Lab: Gradient Descent for Linear Regression
    • Optional Lab: Python, NumPy and Vectorization
    • Optional Lab: Multiple Variable Linear Regression
    • Optional Lab: Feature scaling and Learning Rate (Multi-variable)
    • Optional Lab: Feature Engineering and Polynomial Regression
    • Optional Lab: Linear Regression using Scikit-Learn
    • Practice Lab: Linear Regression
    • Optional Lab: Classification
    • Optional Lab: Logistic Regression
    • Optional Lab: Logistic Regression, Decision Boundary
    • Optional Lab: Logistic Regression, Logistic Loss
    • Optional Lab: Cost Function for Logistic Regression
    • Optional Lab: Gradient Descent for Logistic Regression
    • Ungraded Lab: Logistic Regression using Scikit-Learn
    • Ungraded Lab: Overfitting
    • Optional Lab - Regularized Cost and Gradient
    • Logistic Regression
  • Udacity
    • Changing K Solution
    • DBSCAN Lab
    • Feature Scaling Solution
    • 2. KMeans vs GMM on The Iris Dataset
    • Selecting the optimal number of clusters using Silhouette Scoring on KMeans and GMM clustering - SOLUTION
    • Implementing the Gradient Descent Algorithm
    • Hierarchical Clustering Lab
    • Independent Component Analysis Lab
    • Interpret PCA Results Solution
    • Mini Batch Gradient Descent
    • Multiple Linear Regression
    • PCA 1 Solution
    • PCA Mini Project Solution
    • Random Projection Solution
    • Predicting Student Admissions with Neural Networks
    • Perceptron algorithm
  • Unsupervised,Recommenders,Reinforcement Learning
    • Unsupervised
  • Unsupervised notebooks
    • K-means Clustering
    • Practice lab: Collaborative Filtering Recommender Systems
    • PCA - An example on Exploratory Data Analysis
    • Practice lab: Deep Learning for Content-Based Filtering
  • Previous
  • Next
  • PCA - An example on Exploratory Data Analysis
    • Importing the libraries
    • Lecture Example
    • Visualizing the PCA algorithm
    • Visualization of a 3-dimensional dataset
    • Using PCA in Exploratory Data Analysis

PCA - An example on Exploratory Data Analysis¶

In this notebook you will:

  • Replicate Andrew's example on PCA
  • Visualize how PCA works on a 2-dimensional small dataset and that not every projection is "good"
  • Visualize how a 3-dimensional data can also be contained in a 2-dimensional subspace
  • Use PCA to find hidden patterns in a high-dimensional dataset

Importing the libraries¶

In [ ]:
Copied!
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from pca_utils import plot_widget
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
import matplotlib.pyplot as plt
import plotly.offline as py
import pandas as pd import numpy as np from sklearn.decomposition import PCA from pca_utils import plot_widget from bokeh.io import show, output_notebook from bokeh.plotting import figure import matplotlib.pyplot as plt import plotly.offline as py
In [ ]:
Copied!
py.init_notebook_mode()
py.init_notebook_mode()
In [ ]:
Copied!
output_notebook()
output_notebook()

Lecture Example¶

We are going work on the same example that Andrew has shown in the lecture.

In [ ]:
Copied!
X = np.array([[ 99,  -1],
       [ 98,  -1],
       [ 97,  -2],
       [101,   1],
       [102,   1],
       [103,   2]])
X = np.array([[ 99, -1], [ 98, -1], [ 97, -2], [101, 1], [102, 1], [103, 2]])
In [ ]:
Copied!
plt.plot(X[:,0], X[:,1], 'ro')
plt.plot(X[:,0], X[:,1], 'ro')
In [ ]:
Copied!
# Loading the PCA algorithm
pca_2 = PCA(n_components=2)
pca_2
# Loading the PCA algorithm pca_2 = PCA(n_components=2) pca_2
In [ ]:
Copied!
# Let's fit the data. We do not need to scale it, since sklearn's implementation already handles it.
pca_2.fit(X)
# Let's fit the data. We do not need to scale it, since sklearn's implementation already handles it. pca_2.fit(X)
In [ ]:
Copied!
pca_2.explained_variance_ratio_
pca_2.explained_variance_ratio_

The coordinates on the first principal component (first axis) are enough to retain 99.24% of the information ("explained variance"). The second principal component adds an additional 0.76% of the information ("explained variance") that is not stored in the first principal component coordinates.

In [ ]:
Copied!
X_trans_2 = pca_2.transform(X)
X_trans_2
X_trans_2 = pca_2.transform(X) X_trans_2

Think of column 1 as the coordinate along the first principal component (the first new axis) and column 2 as the coordinate along the second principal component (the second new axis).

You can probably just choose the first principal component since it retains 99% of the information (explained variance).

In [ ]:
Copied!
pca_1 = PCA(n_components=1)
pca_1
pca_1 = PCA(n_components=1) pca_1
In [ ]:
Copied!
pca_1.fit(X)
pca_1.explained_variance_ratio_
pca_1.fit(X) pca_1.explained_variance_ratio_
In [ ]:
Copied!
X_trans_1 = pca_1.transform(X)
X_trans_1
X_trans_1 = pca_1.transform(X) X_trans_1

Notice how this column is just the first column of X_trans_2.

If you had 2 features (two columns of data) and choose 2 principal components, then you'll keep all the information and the data will end up the same as the original.

In [ ]:
Copied!
X_reduced_2 = pca_2.inverse_transform(X_trans_2)
X_reduced_2
X_reduced_2 = pca_2.inverse_transform(X_trans_2) X_reduced_2
In [ ]:
Copied!
plt.plot(X_reduced_2[:,0], X_reduced_2[:,1], 'ro')
plt.plot(X_reduced_2[:,0], X_reduced_2[:,1], 'ro')

Reduce to 1 dimension instead of 2

In [ ]:
Copied!
X_reduced_1 = pca_1.inverse_transform(X_trans_1)
X_reduced_1
X_reduced_1 = pca_1.inverse_transform(X_trans_1) X_reduced_1
In [ ]:
Copied!
plt.plot(X_reduced_1[:,0], X_reduced_1[:,1], 'ro')
plt.plot(X_reduced_1[:,0], X_reduced_1[:,1], 'ro')

Notice how the data are now just on a single line (this line is the single principal component that was used to describe the data; and each example had a single "coordinate" along that axis to describe its location.

Visualizing the PCA algorithm¶

Let's define $10$ points in the plane and use them as an example to visualize how we can compress this points in 1 dimension. You will see that there are good ways and bad ways.

In [ ]:
Copied!
X = np.array([[-0.83934975, -0.21160323],
       [ 0.67508491,  0.25113527],
       [-0.05495253,  0.36339613],
       [-0.57524042,  0.24450324],
       [ 0.58468572,  0.95337657],
       [ 0.5663363 ,  0.07555096],
       [-0.50228538, -0.65749982],
       [-0.14075593,  0.02713815],
       [ 0.2587186 , -0.26890678],
       [ 0.02775847, -0.77709049]])
X = np.array([[-0.83934975, -0.21160323], [ 0.67508491, 0.25113527], [-0.05495253, 0.36339613], [-0.57524042, 0.24450324], [ 0.58468572, 0.95337657], [ 0.5663363 , 0.07555096], [-0.50228538, -0.65749982], [-0.14075593, 0.02713815], [ 0.2587186 , -0.26890678], [ 0.02775847, -0.77709049]])
In [ ]:
Copied!
p = figure(title = '10-point scatterplot', x_axis_label = 'x-axis', y_axis_label = 'y-axis') ## Creates the figure object
p.scatter(X[:,0],X[:,1],marker = 'o', color = '#C00000', size = 5) ## Add the scatter plot

## Some visual adjustments
p.grid.visible = False
p.grid.visible = False
p.outline_line_color = None 
p.toolbar.logo = None
p.toolbar_location = None
p.xaxis.axis_line_color = "#f0f0f0"
p.xaxis.axis_line_width = 5
p.yaxis.axis_line_color = "#f0f0f0"
p.yaxis.axis_line_width = 5

## Shows the figure
show(p)
p = figure(title = '10-point scatterplot', x_axis_label = 'x-axis', y_axis_label = 'y-axis') ## Creates the figure object p.scatter(X[:,0],X[:,1],marker = 'o', color = '#C00000', size = 5) ## Add the scatter plot ## Some visual adjustments p.grid.visible = False p.grid.visible = False p.outline_line_color = None p.toolbar.logo = None p.toolbar_location = None p.xaxis.axis_line_color = "#f0f0f0" p.xaxis.axis_line_width = 5 p.yaxis.axis_line_color = "#f0f0f0" p.yaxis.axis_line_width = 5 ## Shows the figure show(p)

The next code will generate a widget where you can see how different ways of compressing this data into 1-dimensional datapoints will lead to different ways on how the points are spread in this new space. The line generated by PCA is the line that keeps the points as far as possible from each other.

You can use the slider to rotate the black line through its center and see how the points' projection onto the line will change as we rotate the line.

You can notice that there are projections that place different points in almost the same point, and there are projections that keep the points as separated as they were in the plane.

In [ ]:
Copied!
plot_widget()
plot_widget()

Visualization of a 3-dimensional dataset¶

In this section we will see how some 3 dimensional data can be condensed into a 2 dimensional space.

In [ ]:
Copied!
from pca_utils import random_point_circle, plot_3d_2d_graphs
from pca_utils import random_point_circle, plot_3d_2d_graphs
In [ ]:
Copied!
X = random_point_circle(n = 150)
X = random_point_circle(n = 150)
In [ ]:
Copied!
deb = plot_3d_2d_graphs(X)
deb = plot_3d_2d_graphs(X)
In [ ]:
Copied!
deb.update_layout(yaxis2 = dict(title_text = 'test', visible=True))
deb.update_layout(yaxis2 = dict(title_text = 'test', visible=True))

Using PCA in Exploratory Data Analysis¶

Let's load a toy dataset with $500$ samples and $1000$ features.

In [ ]:
Copied!
df = pd.read_csv("toy_dataset.csv")
df = pd.read_csv("toy_dataset.csv")
In [ ]:
Copied!
df.head()
df.head()

This is a dataset with $1000$ features.

Let's try to see if there is a pattern in the data. The following function will randomly sample 100 pairwise tuples (x,y) of features, so we can scatter-plot them.

In [ ]:
Copied!
def get_pairs(n = 100):
    from random import randint
    i = 0
    tuples = []
    while i < 100:
        x = df.columns[randint(0,999)]
        y = df.columns[randint(0,999)]
        while x == y or (x,y) in tuples or (y,x) in tuples:
            y = df.columns[randint(0,999)]
        tuples.append((x,y))
        i+=1
    return tuples
def get_pairs(n = 100): from random import randint i = 0 tuples = [] while i < 100: x = df.columns[randint(0,999)] y = df.columns[randint(0,999)] while x == y or (x,y) in tuples or (y,x) in tuples: y = df.columns[randint(0,999)] tuples.append((x,y)) i+=1 return tuples
In [ ]:
Copied!
pairs = get_pairs()
pairs = get_pairs()

Now let's plot them!

In [ ]:
Copied!
fig, axs = plt.subplots(10,10, figsize = (35,35))
i = 0
for rows in axs:
    for ax in rows:
        ax.scatter(df[pairs[i][0]],df[pairs[i][1]], color = "#C00000")
        ax.set_xlabel(pairs[i][0])
        ax.set_ylabel(pairs[i][1])
        i+=1
fig, axs = plt.subplots(10,10, figsize = (35,35)) i = 0 for rows in axs: for ax in rows: ax.scatter(df[pairs[i][0]],df[pairs[i][1]], color = "#C00000") ax.set_xlabel(pairs[i][0]) ax.set_ylabel(pairs[i][1]) i+=1

It looks like there is not much information hidden in pairwise features. Also, it is not possible to check every combination, due to the amount of features. Let's try to see the linear correlation between them.

In [ ]:
Copied!
# This may take 1 minute to run
corr = df.corr()
# This may take 1 minute to run corr = df.corr()
In [ ]:
Copied!
## This will show all the features that have correlation > 0.5 in absolute value. We remove the features 
## with correlation == 1 to remove the correlation of a feature with itself

mask = (abs(corr) > 0.5) & (abs(corr) != 1)
corr.where(mask).stack().sort_values()
## This will show all the features that have correlation > 0.5 in absolute value. We remove the features ## with correlation == 1 to remove the correlation of a feature with itself mask = (abs(corr) > 0.5) & (abs(corr) != 1) corr.where(mask).stack().sort_values()

The maximum and minimum correlation is around $0.631$ - $0.632$. This does not show too much as well.

Let's try PCA decomposition to compress our data into a 2-dimensional subspace (plane) so we can plot it as scatter plot.

In [ ]:
Copied!
# Loading the PCA object
pca = PCA(n_components = 2) # Here we choose the number of components that we will keep.
X_pca = pca.fit_transform(df)
df_pca = pd.DataFrame(X_pca, columns = ['principal_component_1','principal_component_2'])
# Loading the PCA object pca = PCA(n_components = 2) # Here we choose the number of components that we will keep. X_pca = pca.fit_transform(df) df_pca = pd.DataFrame(X_pca, columns = ['principal_component_1','principal_component_2'])
In [ ]:
Copied!
df_pca.head()
df_pca.head()
In [ ]:
Copied!
plt.scatter(df_pca['principal_component_1'],df_pca['principal_component_2'], color = "#C00000")
plt.xlabel('principal_component_1')
plt.ylabel('principal_component_2')
plt.title('PCA decomposition')
plt.scatter(df_pca['principal_component_1'],df_pca['principal_component_2'], color = "#C00000") plt.xlabel('principal_component_1') plt.ylabel('principal_component_2') plt.title('PCA decomposition')

This is great! We can see well defined clusters.

In [ ]:
Copied!
# pca.explained_variance_ration_ returns a list where it shows the amount of variance explained by each principal component.
sum(pca.explained_variance_ratio_)
# pca.explained_variance_ration_ returns a list where it shows the amount of variance explained by each principal component. sum(pca.explained_variance_ratio_)

And we preserved only around 14.6% of the variance!

Quite impressive! We can clearly see clusters in our data, something that we could not see before. How many clusters can you spot? 8, 10?

If we run a PCA to plot 3 dimensions, we will get more information from data.

In [ ]:
Copied!
pca_3 = PCA(n_components = 3).fit(df)
X_t = pca_3.transform(df)
df_pca_3 = pd.DataFrame(X_t,columns = ['principal_component_1','principal_component_2','principal_component_3'])
pca_3 = PCA(n_components = 3).fit(df) X_t = pca_3.transform(df) df_pca_3 = pd.DataFrame(X_t,columns = ['principal_component_1','principal_component_2','principal_component_3'])
In [ ]:
Copied!
import plotly.express as px
import plotly.express as px
In [ ]:
Copied!
fig = px.scatter_3d(df_pca_3, x = 'principal_component_1', y = 'principal_component_2', z = 'principal_component_3').update_traces(marker = dict(color = "#C00000"))
fig.show()
fig = px.scatter_3d(df_pca_3, x = 'principal_component_1', y = 'principal_component_2', z = 'principal_component_3').update_traces(marker = dict(color = "#C00000")) fig.show()
In [ ]:
Copied!
sum(pca_3.explained_variance_ratio_)
sum(pca_3.explained_variance_ratio_)

Now we preserved 19% of the variance and we can clearly see 10 clusters.

Congratulations on finishing this notebook!


Documentation built with MkDocs.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search