ML/DL Notes
  • Home
  • Advacned Notebooks
    • Optional Lab: Model Evaluation and Selection
    • Optional Lab: Diagnosing Bias and Variance
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Binary
    • Optional Lab - Neurons and Layers
    • Optional Lab - Simple Neural Network
    • Optional Lab - Simple Neural Network
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Multiclass
    • Optional Lab: Back propagation using a computation graph
    • Optional Lab - Derivatives
    • Optional Lab - Multi-class Classification
    • Optional Lab - ReLU activation
    • Optional Lab - Softmax Function
    • Practice Lab: Advice for Applying Machine Learning
    • Practice Lab: Decision Trees
    • Ungraded Lab: Decision Trees
    • Ungraded Lab - Trees Ensemble
  • Advanced Learning Algorithms
    • Advanced
  • Reinforcement learning notebooks
    • Deep Q-Learning - Lunar Lander
    • State Action Value Function Example
    • Utils
  • Supervised Legacy
    • Classic_Supervised
  • Supervised Notebooks
    • Optional Lab: Cost Function
    • Optional Lab: Gradient Descent for Linear Regression
    • Optional Lab: Python, NumPy and Vectorization
    • Optional Lab: Multiple Variable Linear Regression
    • Optional Lab: Feature scaling and Learning Rate (Multi-variable)
    • Optional Lab: Feature Engineering and Polynomial Regression
    • Optional Lab: Linear Regression using Scikit-Learn
    • Practice Lab: Linear Regression
    • Optional Lab: Classification
    • Optional Lab: Logistic Regression
    • Optional Lab: Logistic Regression, Decision Boundary
    • Optional Lab: Logistic Regression, Logistic Loss
    • Optional Lab: Cost Function for Logistic Regression
    • Optional Lab: Gradient Descent for Logistic Regression
    • Ungraded Lab: Logistic Regression using Scikit-Learn
    • Ungraded Lab: Overfitting
    • Optional Lab - Regularized Cost and Gradient
    • Logistic Regression
  • Udacity
    • Changing K Solution
    • DBSCAN Lab
    • Feature Scaling Solution
    • 2. KMeans vs GMM on The Iris Dataset
    • Selecting the optimal number of clusters using Silhouette Scoring on KMeans and GMM clustering - SOLUTION
    • Implementing the Gradient Descent Algorithm
    • Hierarchical Clustering Lab
    • Independent Component Analysis Lab
    • Interpret PCA Results Solution
    • Mini Batch Gradient Descent
    • Multiple Linear Regression
    • PCA 1 Solution
    • PCA Mini Project Solution
    • Random Projection Solution
    • Predicting Student Admissions with Neural Networks
    • Perceptron algorithm
  • Unsupervised,Recommenders,Reinforcement Learning
    • Unsupervised
  • Unsupervised notebooks
    • K-means Clustering
    • Practice lab: Collaborative Filtering Recommender Systems
    • PCA - An example on Exploratory Data Analysis
    • Practice lab: Deep Learning for Content-Based Filtering
  • Previous
  • Next

PCA Mini Project - Solution¶

In the lesson, you saw how you could use PCA to substantially reduce the dimensionality of the handwritten digits. In this mini-project, you will be using the cars.csv file.

To begin, run the cell below to read in the necessary libraries and the dataset. I also read in the helper functions that you used throughout the lesson in case you might find them helpful in completing this project. Otherwise, you can always create functions of your own!

In [ ]:
Copied!
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from helper_functions import do_pca, scree_plot, plot_components, pca_results
from IPython import display
import test_code2 as t

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df = pd.read_csv('./data/cars.csv')
import pandas as pd import numpy as np from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, accuracy_score from helper_functions import do_pca, scree_plot, plot_components, pca_results from IPython import display import test_code2 as t import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline df = pd.read_csv('./data/cars.csv')

1. Now your data is stored in df. Use the below cells to take a look your dataset. At the end of your exploration, use your findings to match the appropriate variable to each key in the dictionary below.

In [ ]:
Copied!
df.head()
df.head()
In [ ]:
Copied!
df.describe()
df.describe()
In [ ]:
Copied!
df.shape
df.shape
In [ ]:
Copied!
a = 7
b = 66
c = 387
d = 18
e = 0.23
f = 0.05


solution_1_dict = {
    'The number of cars in the dataset': c,
    'The number of car features in the dataset': d,
    'The number of dummy variables in the dataset': a,
    'The proportion of minivans in the dataset': f,
    'The max highway mpg for any car': b
}
a = 7 b = 66 c = 387 d = 18 e = 0.23 f = 0.05 solution_1_dict = { 'The number of cars in the dataset': c, 'The number of car features in the dataset': d, 'The number of dummy variables in the dataset': a, 'The proportion of minivans in the dataset': f, 'The max highway mpg for any car': b }
In [ ]:
Copied!
# Check your solution against ours by running this cell
display.HTML(t.check_question_one(solution_1_dict))
# Check your solution against ours by running this cell display.HTML(t.check_question_one(solution_1_dict))

2. There are some particularly nice properties about PCA to keep in mind. Use the dictionary below to match the correct variable as the key to each statement. When you are ready, check your solution against ours by running the following cell.

In [ ]:
Copied!
a = True
b = False

solution_2_dict = {
    'The components span the directions of maximum variability.': a,
    'The components are always orthogonal to one another.': a,
    'Eigenvalues tell us the amount of information a component holds': a
}
a = True b = False solution_2_dict = { 'The components span the directions of maximum variability.': a, 'The components are always orthogonal to one another.': a, 'Eigenvalues tell us the amount of information a component holds': a }
In [ ]:
Copied!
# Check your solution against ours by running this cell
t.check_question_two(solution_2_dict)
# Check your solution against ours by running this cell t.check_question_two(solution_2_dict)

3. Fit PCA to reduce the current dimensionality of the datset to 3 dimensions. You can use the helper functions, or perform the steps on your own. If you fit on your own, be sure to standardize your data. At the end of this process, you will want an X matrix with the reduced dimensionality to only 3 features. Additionally, you will want your pca object back that has been used to fit and transform your dataset.

In [ ]:
Copied!
pca, X_pca = do_pca(3, df)
pca, X_pca = do_pca(3, df)

4. Once you have your pca object, you can take a closer look at what comprises each of the principal components. Use the pca_results function from the helper_functions module assist with taking a closer look at the results of your analysis. The function takes two arguments: the full dataset and the pca object you created.

In [ ]:
Copied!
pca_results(df, pca)
pca_results(df, pca)

5. Use the results, to match each of the variables as the value to the most appropriate key in the dictionary below. When you are ready to check your answers, run the following cell to see if your solution matches ours!

In [ ]:
Copied!
a = 'car weight'
b = 'sports cars'
c = 'gas mileage'
d = 0.4352
e = 0.3061
f = 0.1667
g = 0.7053

solution_5_dict = {
    'The first component positively weights items related to': c, 
    'The amount of variability explained by the first component is': d,
    'The largest weight of the second component is related to': b,
    'The total amount of variability explained by the first three components': g
}
a = 'car weight' b = 'sports cars' c = 'gas mileage' d = 0.4352 e = 0.3061 f = 0.1667 g = 0.7053 solution_5_dict = { 'The first component positively weights items related to': c, 'The amount of variability explained by the first component is': d, 'The largest weight of the second component is related to': b, 'The total amount of variability explained by the first three components': g }
In [ ]:
Copied!
# Run this cell to check if your solution matches ours.
t.check_question_five(solution_5_dict)
# Run this cell to check if your solution matches ours. t.check_question_five(solution_5_dict)

6. How many components need to be kept to explain at least 85% of the variability in the original dataset? When you think you have the answer, store it in the variable num_comps. Then run the following cell to see if your solution matches ours!

In [ ]:
Copied!
for comp in range(3, df.shape[1]):
    pca, X_pca = do_pca(comp, df)
    comp_check = pca_results(df, pca)
    if comp_check['Explained Variance'].sum() > 0.85:
        break
        

num_comps = comp_check.shape[0]
print("Using {} components, we can explain {}% of the variability in the original data.".format(comp_check.shape[0],comp_check['Explained Variance'].sum()))
for comp in range(3, df.shape[1]): pca, X_pca = do_pca(comp, df) comp_check = pca_results(df, pca) if comp_check['Explained Variance'].sum() > 0.85: break num_comps = comp_check.shape[0] print("Using {} components, we can explain {}% of the variability in the original data.".format(comp_check.shape[0],comp_check['Explained Variance'].sum()))
In [ ]:
Copied!
# How check your answer here to complete this mini project
display.HTML(t.question_check_six(num_comps))
# How check your answer here to complete this mini project display.HTML(t.question_check_six(num_comps))
In [ ]:
Copied!


Documentation built with MkDocs.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search