PCA 1 Solution - ML/DL Notes

Your Turn¶

In the previous video, you saw an example of working with some MNIST digits data. A link to the dataset can be found here: http://yann.lecun.com/exdb/mnist/.

First, let's import the necessary libraries. Notice, there are also some imports from a file called helper_functions, which contains the functions used in the previous video.

In [2]:

Copied!





import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from helper_functions import show_images, show_images_by_digit, fit_random_forest_classifier2 
from helper_functions import fit_random_forest_classifier, do_pca, plot_components
import test_code as t

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from helper_functions import show_images, show_images_by_digit, fit_random_forest_classifier2 
from helper_functions import fit_random_forest_classifier, do_pca, plot_components
import test_code as t

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

1. Use pandas to read in the dataset, which can be found at the following address './data/train.csv'. Take a look at info about the data using head, tail, describe, info, etc. You can learn more about the data values from the article here: https://homepages.inf.ed.ac.uk/rbf/HIPR2/value.htm.

In [15]:

Copied!

train = pd.read_csv('./data/train.csv')
train.fillna(0, inplace=True)
train = pd.read_csv('./data/train.csv')
train.fillna(0, inplace=True)

2. Create a vector called y that holds the label column of the dataset. Store all other columns holding the pixel data of your images in X.

In [19]:

Copied!

# save the labels to a Pandas series target
y = train['label']

# Drop the label feature
X = train.drop("label",axis=1)
# save the labels to a Pandas series target
y = train['label']

# Drop the label feature
X = train.drop("label",axis=1)

In [33]:

Copied!

#Check Your Solution 
t.question_two_check(y, X)
#Check Your Solution 
t.question_two_check(y, X)

That looks right!

3. Now use the show_images_by_digit function from the helper_functions module to take a look some of the 1's, 2's, 3's, or any other value you are interested in looking at. Do they all look like what you would expect?

In [17]:

Copied!

show_images_by_digit(2) # Try looking at a few other digits
show_images_by_digit(2) # Try looking at a few other digits

No description has been provided for this image

4. Now that you have had a chance to look through some of the data, you can try some different algorithms to see what works well to use the X matrix to predict the response well. If you would like to use the function I used in the video regarding random forests, you can run the code below, but you might also try any of the supervised techniques you learned in the previous course to see what works best.

If you decide to put together your own classifier, remember the 4 steps to this process:

I. Instantiate your model. (with all the hyperparameter values you care about)

II. Fit your model. (to the training data)

III. Predict using your fitted model. (on the test data)

IV. Score your model. (comparing the predictions to the actual values on the test data)

You can also try a grid search to see if you can improve on your initial predictions.

In [34]:

Copied!

fit_random_forest_classifier(X, y)
fit_random_forest_classifier(X, y)

[[1317    0    2    0    2    1    5    0    6    0]
 [   1 1501    5    3    2    2    3    1    1    1]
 [   6    4 1354    8   13    1    7   10   10    1]
 [   6    3   13 1378    3   22    2   16   18   10]
 [   4    0    1    0 1311    0    7    3    3   29]
 [   6    2    2   18    2 1149   10    1    8    7]
 [   9    3    1    0    4    8 1366    0    6    0]
 [   1    8   23    2   10    1    0 1405    4   26]
 [   3    5    6   11    7    6    5    3 1273   15]
 [   6    4    4   20   11    4    2    9   13 1275]]
0.961688311688

5. Now the purpose of this lesson, to look at PCA. In the video, I created a model just using two features. Replicate the process below. You can use the same do_pca function that was created in the previous video. Store your variables in pca and X_pca.

In [36]:

Copied!

pca, X_pca = do_pca(2, X) #performs PCA to create two components
pca, X_pca = do_pca(2, X) #performs PCA to create two components

6. The X_pca has moved the original more than 700 features down to only 2 features that capture the majority of the variability in the pixel values. Use the space below to fit a model using these two features to predict the written value. You can use the random forest model by running fit_random_forest_classifier the same way as in the video. How well does it perform?

In [37]:

Copied!

fit_random_forest_classifier(X_pca, y)
fit_random_forest_classifier(X_pca, y)

[[ 828    0  181   49   32   44  139    8   40   12]
 [   1 1291    2   16   28   34   19   44   38   47]
 [ 228    5  305  194  130  119  204   52  121   56]
 [  79   26  207  222  149  178  172  143  174  121]
 [  51   60  120  163  226  111  123  171  129  204]
 [  64   22  133  131  135  228  191   48  194   59]
 [ 203   28  218  159  112  203  279   33  129   33]
 [   9   81   60  112  177   74   30  527   78  332]
 [  57   51  119  158  138  224  154   86  230  117]
 [  15  110   61   96  206   62   50  312   71  365]]
0.324747474747

7. Now you can look at the separation of the values using the plot_components function. If you plot all of the points (more than 40,000), you will likely not be able to see much of what is happening. I recommend plotting just a subset of the data. Which value(s) have some separation that are being predicted better than others based on these two components?

In [38]:

Copied!

plot_components(X_pca[:100], y[:100])
plot_components(X_pca[:100], y[:100])

8. See if you can find a reduced number of features that provides better separation to make predictions. Say you want to get separation that allows for accuracy of more than 90%, how many principal components are needed to obtain this level of accuracy? Were you able to substantially reduce the number of features needed in your final model?

In [42]:

Copied!





for comp in range(2, 100):
    pca, X_pca = do_pca(comp, X)
    acc = fit_random_forest_classifier(X_pca, y)
    if acc > .90:
        print("With only {} components, a random forest acheived an accuracy of {}.".format(comp, acc))
        break
for comp in range(2, 100):
    pca, X_pca = do_pca(comp, X)
    acc = fit_random_forest_classifier(X_pca, y)
    if acc > .90:
        print("With only {} components, a random forest acheived an accuracy of {}.".format(comp, acc))
        break

[[ 820    0  179   51   33   50  140    5   47    8]
 [   1 1294    3   14   30   33   20   50   35   40]
 [ 224    5  316  205  129  111  200   54  116   54]
 [  79   22  199  223  146  189  172  146  169  126]
 [  52   58  128  164  223  113  120  163  131  206]
 [  63   24  133  127  139  227  192   43  201   56]
 [ 190   28  228  157  104  196  284   32  145   33]
 [   9   79   66  113  175   67   35  522   81  333]
 [  51   54  129  159  138  210  155   84  232  122]
 [  16  113   62  103  192   59   55  310   74  364]]
0.325036075036
[[1002    0   37   12   22   51  151    0   57    1]
 [   0 1420    6   14    5    5   25    2   40    3]
 [  93    4  671  307   29   65  184    0   58    3]
 [  22   24  333  573   50   85  179   33  155   17]
 [  44   12    8   22  752   79   12   97   53  279]
 [  58    4   88  104  174  384  151   21  178   43]
 [ 210   34  121   94   17   76  739    2  104    0]
 [   4   23    0   27  194   30    7  813   20  362]
 [  58   39   61  192  141  266  215   18  320   24]
 [  12    3    1   19  321   31    6  293   17  645]]
0.528066378066
[[1157    0   48    3   25   31   17    2   49    1]
 [   1 1444    9    3    2    2   12    6   37    4]
 [  92    7  945   87   34   30  171    4   42    2]
 [  18   11   95  986    9  123   11   20  185   13]
 [  22   10   17    1  894   38   21   81    9  265]
 [  39    3   59  142   68  540   15   35  266   38]
 [  33    8  121    3   19   11 1183    0   19    0]
 [   4   26    6   12  120   32    2  891   51  336]
 [  81   42   31  176   32  272   12   29  632   27]
 [  14    3    3   13  282   30    2  257   32  712]]
0.677056277056
[[1182    0   38    2   11   20   21    3   56    0]
 [   1 1463   14    7    0    5    4    6   17    3]
 [  60    3 1093   87   22   22   84    7   34    2]
 [  17   10   96 1040   11  124    6   26  135    6]
 [  13    9   15    3  951    7   17   50   10  283]
 [  41    4   51  150   25  590   10   63  254   17]
 [  37    4   83    1   17    8 1224    0   23    0]
 [   4   26    9   19   58   24    3 1193   23  121]
 [  70   22   22  116   27  228   13   18  788   30]
 [  11    3    1   10  276   28    4  121   25  869]]
0.749855699856
[[1218    0    5    4   10   38   25    0   32    1]
 [   1 1483    9    8    0    2    5    2    6    4]
 [  14    1 1247   43   24    3   21    9   52    0]
 [   7   12   51 1101   10   89    9   32  153    7]
 [   7   11   20    0 1001    5   16   32   14  252]
 [  43    1    5  110   23  907   25   16   66    9]
 [  37    4   23    2   16   10 1292    0   13    0]
 [   2   13   20    6   50    3    0 1265   21  100]
 [  20   12   28  111   36   61   14   17 1014   21]
 [   9    5    2   15  257   15    2   96   31  916]]
0.825685425685
[[1220    0    5    4   12   35   24    0   33    0]
 [   0 1489    8    6    0    1    5    2    5    4]
 [  16    4 1253   38   22    3   24   10   44    0]
 [  12   14   51 1164   14   66    6   25  112    7]
 [   2   10   18    3 1024    7   16   29   12  237]
 [  39    2    6  113   26  917   26   17   49   10]
 [  44    4   23    2   14   14 1288    0    8    0]
 [   1   11   15    6   41    4    0 1284   24   94]
 [  19   15   26  101   29   51   15   14 1043   21]
 [   9    4    3   14  245   15    2   89   35  932]]
0.837950937951
[[1268    0    5    5    2   23   20    2    8    0]
 [   1 1493    8    7    0    2    5    2    0    2]
 [  17    2 1271   35   18    7   20    9   34    1]
 [  10   11   38 1221    8   51    4   21   95   12]
 [   3    7   18    6 1064    8   15   17    9  211]
 [  36    2   10   76   22  966   22    3   54   14]
 [  38    3   16    2   12   16 1303    0    7    0]
 [   1   13   15    6   31    2    0 1321   11   80]
 [  12   14   19   65   30   57    9    4 1102   22]
 [   8    4    5   18  238   10    2   87   33  943]]
0.862337662338
[[1265    0    6    1    1   21   25    2   12    0]
 [   1 1487    8    6    1    3    5    1    1    7]
 [  14    3 1282   37   20    2   11   13   32    0]
 [   9   10   39 1239    7   44    5   23   83   12]
 [   4    9   14    5 1078    7   14   13   13  201]
 [  25    2    7   62   18 1016   20    2   40   13]
 [  41    4    8    0    7    9 1320    0    8    0]
 [   0   13   19   10   25    1    0 1325   10   77]
 [  10   10   15   42   34   50   10    7 1139   17]
 [  10    4    3   21  212    7    2   75   27  987]]
0.875757575758
[[1267    1    5    2    2   25   20    1    9    1]
 [   0 1490    8    3    1    3    6    2    2    5]
 [  16    2 1296   29   12    2   11   14   31    1]
 [   8    9   36 1265    2   43    3   22   69   14]
 [   4    7   11    2 1161    4   12   10    8  139]
 [  25    1    4   58   16 1019   25    3   38   16]
 [  31    4    8    0    9   12 1325    0    8    0]
 [   1   12   17   10   11    2    0 1338   13   76]
 [  11   13   17   31   19   42    9    6 1169   17]
 [  11    4    4   21  106   11    2   62   24 1103]]
0.897041847042
[[1282    0    5    0    2   17   18    0    7    2]
 [   0 1491    7    4    1    2    6    1    2    6]
 [  16    1 1299   29   12    3    8   12   33    1]
 [   6    8   31 1282    2   39    4   23   63   13]
 [   2    7   13    2 1177    2   12   12    8  123]
 [  18    1    6   62   12 1024   19    3   42   18]
 [  22    4    6    1    9   14 1334    0    7    0]
 [   0   13   19    9   13    2    0 1336   15   73]
 [   6   11   11   32   20   46    8    5 1179   16]
 [  12    2    3   23   92    5    2   66   25 1118]]
0.903463203463
With only 11 components, a random forest acheived an accuracy of 0.9034632034632034.

9. It is possible that extra features in the dataset even lead to overfitting or the curse of dimensionality. Do you have evidence of this happening for this dataset? Can you support your evidence with a visual or table? To avoid printing out all of the metric results, I created another function called fit_random_forest_classifier2. I ran through a significant number of components to create the visual for the solution, but I strongly recommend you look in the range below 100 principal components!

In [46]:

Copied!





# I would highly recommend not running the below code, as it had to run overnight to complete.
# Instead, you can run a smaller number of components that still allows you to see enough.


#accs = []
#comps = []
#for comp in range(2, 700):
#    comps.append(comp)
#    pca, X_pca = do_pca(comp, X)
#    acc = fit_random_forest_classifier2(X_pca, y)
#    accs.append(acc)
# I would highly recommend not running the below code, as it had to run overnight to complete.
# Instead, you can run a smaller number of components that still allows you to see enough.


#accs = []
#comps = []
#for comp in range(2, 700):
#    comps.append(comp)
#    pca, X_pca = do_pca(comp, X)
#    acc = fit_random_forest_classifier2(X_pca, y)
#    accs.append(acc)

In [48]:

Copied!





plt.plot(comps, accs, 'bo');
plt.xlabel('Number of Components');
plt.ylabel('Accuracy');
plt.title('Number of Components by Accuracy');
plt.plot(comps, accs, 'bo');
plt.xlabel('Number of Components');
plt.ylabel('Accuracy');
plt.title('Number of Components by Accuracy');

In [58]:

Copied!

# The max accuracy and corresponding number of components
np.max(accs), comps[np.where(accs == np.max(accs))[0][0]]
# The max accuracy and corresponding number of components
np.max(accs), comps[np.where(accs == np.max(accs))[0][0]]

Out[58]:

(0.94126984126984126, 61)

Here you can see that the accuracy quickly levels off. The maximum accuracy is actually acheived at 61 principal components. Given the slight negative trend also indicates that the final components are mostly containing noise. The 61 components here contain the information needed to be able to determine the images nearly to the same ability as using the entire image. Next, let's take a closer look at exactly what other information we get from PCA, and how we can interpret it.

In [ ]:

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search