PCA Mini Project - Solution¶
In the lesson, you saw how you could use PCA to substantially reduce the dimensionality of the handwritten digits. In this mini-project, you will be using the cars.csv file.
To begin, run the cell below to read in the necessary libraries and the dataset. I also read in the helper functions that you used throughout the lesson in case you might find them helpful in completing this project. Otherwise, you can always create functions of your own!
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from helper_functions import do_pca, scree_plot, plot_components, pca_results
from IPython import display
import test_code2 as t
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('./data/cars.csv')
1. Now your data is stored in df. Use the below cells to take a look your dataset. At the end of your exploration, use your findings to match the appropriate variable to each key in the dictionary below.
df.head()
df.describe()
df.shape
a = 7
b = 66
c = 387
d = 18
e = 0.23
f = 0.05
solution_1_dict = {
'The number of cars in the dataset': c,
'The number of car features in the dataset': d,
'The number of dummy variables in the dataset': a,
'The proportion of minivans in the dataset': f,
'The max highway mpg for any car': b
}
# Check your solution against ours by running this cell
display.HTML(t.check_question_one(solution_1_dict))
2. There are some particularly nice properties about PCA to keep in mind. Use the dictionary below to match the correct variable as the key to each statement. When you are ready, check your solution against ours by running the following cell.
a = True
b = False
solution_2_dict = {
'The components span the directions of maximum variability.': a,
'The components are always orthogonal to one another.': a,
'Eigenvalues tell us the amount of information a component holds': a
}
# Check your solution against ours by running this cell
t.check_question_two(solution_2_dict)
3. Fit PCA to reduce the current dimensionality of the datset to 3 dimensions. You can use the helper functions, or perform the steps on your own. If you fit on your own, be sure to standardize your data. At the end of this process, you will want an X matrix with the reduced dimensionality to only 3 features. Additionally, you will want your pca object back that has been used to fit and transform your dataset.
pca, X_pca = do_pca(3, df)
4. Once you have your pca object, you can take a closer look at what comprises each of the principal components. Use the pca_results function from the helper_functions module assist with taking a closer look at the results of your analysis. The function takes two arguments: the full dataset and the pca object you created.
pca_results(df, pca)
5. Use the results, to match each of the variables as the value to the most appropriate key in the dictionary below. When you are ready to check your answers, run the following cell to see if your solution matches ours!
a = 'car weight'
b = 'sports cars'
c = 'gas mileage'
d = 0.4352
e = 0.3061
f = 0.1667
g = 0.7053
solution_5_dict = {
'The first component positively weights items related to': c,
'The amount of variability explained by the first component is': d,
'The largest weight of the second component is related to': b,
'The total amount of variability explained by the first three components': g
}
# Run this cell to check if your solution matches ours.
t.check_question_five(solution_5_dict)
6. How many components need to be kept to explain at least 85% of the variability in the original dataset? When you think you have the answer, store it in the variable num_comps. Then run the following cell to see if your solution matches ours!
for comp in range(3, df.shape[1]):
pca, X_pca = do_pca(comp, df)
comp_check = pca_results(df, pca)
if comp_check['Explained Variance'].sum() > 0.85:
break
num_comps = comp_check.shape[0]
print("Using {} components, we can explain {}% of the variability in the original data.".format(comp_check.shape[0],comp_check['Explained Variance'].sum()))
# How check your answer here to complete this mini project
display.HTML(t.question_check_six(num_comps))