Your Turn¶
In the previous video, you saw an example of working with some MNIST digits data. A link to the dataset can be found here: http://yann.lecun.com/exdb/mnist/.
First, let's import the necessary libraries. Notice, there are also some imports from a file called helper_functions, which contains the functions used in the previous video.
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from helper_functions import show_images, show_images_by_digit, fit_random_forest_classifier2
from helper_functions import fit_random_forest_classifier, do_pca, plot_components
import test_code as t
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
1. Use pandas to read in the dataset, which can be found at the following address './data/train.csv'. Take a look at info about the data using head, tail, describe, info, etc. You can learn more about the data values from the article here: https://homepages.inf.ed.ac.uk/rbf/HIPR2/value.htm.
train = pd.read_csv('./data/train.csv')
train.fillna(0, inplace=True)
2. Create a vector called y that holds the label column of the dataset. Store all other columns holding the pixel data of your images in X.
# save the labels to a Pandas series target
y = train['label']
# Drop the label feature
X = train.drop("label",axis=1)
#Check Your Solution
t.question_two_check(y, X)
That looks right!
3. Now use the show_images_by_digit function from the helper_functions module to take a look some of the 1's, 2's, 3's, or any other value you are interested in looking at. Do they all look like what you would expect?
show_images_by_digit(2) # Try looking at a few other digits
4. Now that you have had a chance to look through some of the data, you can try some different algorithms to see what works well to use the X matrix to predict the response well. If you would like to use the function I used in the video regarding random forests, you can run the code below, but you might also try any of the supervised techniques you learned in the previous course to see what works best.
If you decide to put together your own classifier, remember the 4 steps to this process:
I. Instantiate your model. (with all the hyperparameter values you care about)
II. Fit your model. (to the training data)
III. Predict using your fitted model. (on the test data)
IV. Score your model. (comparing the predictions to the actual values on the test data)
You can also try a grid search to see if you can improve on your initial predictions.
fit_random_forest_classifier(X, y)
[[1317 0 2 0 2 1 5 0 6 0] [ 1 1501 5 3 2 2 3 1 1 1] [ 6 4 1354 8 13 1 7 10 10 1] [ 6 3 13 1378 3 22 2 16 18 10] [ 4 0 1 0 1311 0 7 3 3 29] [ 6 2 2 18 2 1149 10 1 8 7] [ 9 3 1 0 4 8 1366 0 6 0] [ 1 8 23 2 10 1 0 1405 4 26] [ 3 5 6 11 7 6 5 3 1273 15] [ 6 4 4 20 11 4 2 9 13 1275]] 0.961688311688
5. Now the purpose of this lesson, to look at PCA. In the video, I created a model just using two features. Replicate the process below. You can use the same do_pca function that was created in the previous video. Store your variables in pca and X_pca.
pca, X_pca = do_pca(2, X) #performs PCA to create two components
6. The X_pca has moved the original more than 700 features down to only 2 features that capture the majority of the variability in the pixel values. Use the space below to fit a model using these two features to predict the written value. You can use the random forest model by running fit_random_forest_classifier the same way as in the video. How well does it perform?
fit_random_forest_classifier(X_pca, y)
[[ 828 0 181 49 32 44 139 8 40 12] [ 1 1291 2 16 28 34 19 44 38 47] [ 228 5 305 194 130 119 204 52 121 56] [ 79 26 207 222 149 178 172 143 174 121] [ 51 60 120 163 226 111 123 171 129 204] [ 64 22 133 131 135 228 191 48 194 59] [ 203 28 218 159 112 203 279 33 129 33] [ 9 81 60 112 177 74 30 527 78 332] [ 57 51 119 158 138 224 154 86 230 117] [ 15 110 61 96 206 62 50 312 71 365]] 0.324747474747
7. Now you can look at the separation of the values using the plot_components function. If you plot all of the points (more than 40,000), you will likely not be able to see much of what is happening. I recommend plotting just a subset of the data. Which value(s) have some separation that are being predicted better than others based on these two components?
plot_components(X_pca[:100], y[:100])
8. See if you can find a reduced number of features that provides better separation to make predictions. Say you want to get separation that allows for accuracy of more than 90%, how many principal components are needed to obtain this level of accuracy? Were you able to substantially reduce the number of features needed in your final model?
for comp in range(2, 100):
pca, X_pca = do_pca(comp, X)
acc = fit_random_forest_classifier(X_pca, y)
if acc > .90:
print("With only {} components, a random forest acheived an accuracy of {}.".format(comp, acc))
break
[[ 820 0 179 51 33 50 140 5 47 8] [ 1 1294 3 14 30 33 20 50 35 40] [ 224 5 316 205 129 111 200 54 116 54] [ 79 22 199 223 146 189 172 146 169 126] [ 52 58 128 164 223 113 120 163 131 206] [ 63 24 133 127 139 227 192 43 201 56] [ 190 28 228 157 104 196 284 32 145 33] [ 9 79 66 113 175 67 35 522 81 333] [ 51 54 129 159 138 210 155 84 232 122] [ 16 113 62 103 192 59 55 310 74 364]] 0.325036075036 [[1002 0 37 12 22 51 151 0 57 1] [ 0 1420 6 14 5 5 25 2 40 3] [ 93 4 671 307 29 65 184 0 58 3] [ 22 24 333 573 50 85 179 33 155 17] [ 44 12 8 22 752 79 12 97 53 279] [ 58 4 88 104 174 384 151 21 178 43] [ 210 34 121 94 17 76 739 2 104 0] [ 4 23 0 27 194 30 7 813 20 362] [ 58 39 61 192 141 266 215 18 320 24] [ 12 3 1 19 321 31 6 293 17 645]] 0.528066378066 [[1157 0 48 3 25 31 17 2 49 1] [ 1 1444 9 3 2 2 12 6 37 4] [ 92 7 945 87 34 30 171 4 42 2] [ 18 11 95 986 9 123 11 20 185 13] [ 22 10 17 1 894 38 21 81 9 265] [ 39 3 59 142 68 540 15 35 266 38] [ 33 8 121 3 19 11 1183 0 19 0] [ 4 26 6 12 120 32 2 891 51 336] [ 81 42 31 176 32 272 12 29 632 27] [ 14 3 3 13 282 30 2 257 32 712]] 0.677056277056 [[1182 0 38 2 11 20 21 3 56 0] [ 1 1463 14 7 0 5 4 6 17 3] [ 60 3 1093 87 22 22 84 7 34 2] [ 17 10 96 1040 11 124 6 26 135 6] [ 13 9 15 3 951 7 17 50 10 283] [ 41 4 51 150 25 590 10 63 254 17] [ 37 4 83 1 17 8 1224 0 23 0] [ 4 26 9 19 58 24 3 1193 23 121] [ 70 22 22 116 27 228 13 18 788 30] [ 11 3 1 10 276 28 4 121 25 869]] 0.749855699856 [[1218 0 5 4 10 38 25 0 32 1] [ 1 1483 9 8 0 2 5 2 6 4] [ 14 1 1247 43 24 3 21 9 52 0] [ 7 12 51 1101 10 89 9 32 153 7] [ 7 11 20 0 1001 5 16 32 14 252] [ 43 1 5 110 23 907 25 16 66 9] [ 37 4 23 2 16 10 1292 0 13 0] [ 2 13 20 6 50 3 0 1265 21 100] [ 20 12 28 111 36 61 14 17 1014 21] [ 9 5 2 15 257 15 2 96 31 916]] 0.825685425685 [[1220 0 5 4 12 35 24 0 33 0] [ 0 1489 8 6 0 1 5 2 5 4] [ 16 4 1253 38 22 3 24 10 44 0] [ 12 14 51 1164 14 66 6 25 112 7] [ 2 10 18 3 1024 7 16 29 12 237] [ 39 2 6 113 26 917 26 17 49 10] [ 44 4 23 2 14 14 1288 0 8 0] [ 1 11 15 6 41 4 0 1284 24 94] [ 19 15 26 101 29 51 15 14 1043 21] [ 9 4 3 14 245 15 2 89 35 932]] 0.837950937951 [[1268 0 5 5 2 23 20 2 8 0] [ 1 1493 8 7 0 2 5 2 0 2] [ 17 2 1271 35 18 7 20 9 34 1] [ 10 11 38 1221 8 51 4 21 95 12] [ 3 7 18 6 1064 8 15 17 9 211] [ 36 2 10 76 22 966 22 3 54 14] [ 38 3 16 2 12 16 1303 0 7 0] [ 1 13 15 6 31 2 0 1321 11 80] [ 12 14 19 65 30 57 9 4 1102 22] [ 8 4 5 18 238 10 2 87 33 943]] 0.862337662338 [[1265 0 6 1 1 21 25 2 12 0] [ 1 1487 8 6 1 3 5 1 1 7] [ 14 3 1282 37 20 2 11 13 32 0] [ 9 10 39 1239 7 44 5 23 83 12] [ 4 9 14 5 1078 7 14 13 13 201] [ 25 2 7 62 18 1016 20 2 40 13] [ 41 4 8 0 7 9 1320 0 8 0] [ 0 13 19 10 25 1 0 1325 10 77] [ 10 10 15 42 34 50 10 7 1139 17] [ 10 4 3 21 212 7 2 75 27 987]] 0.875757575758 [[1267 1 5 2 2 25 20 1 9 1] [ 0 1490 8 3 1 3 6 2 2 5] [ 16 2 1296 29 12 2 11 14 31 1] [ 8 9 36 1265 2 43 3 22 69 14] [ 4 7 11 2 1161 4 12 10 8 139] [ 25 1 4 58 16 1019 25 3 38 16] [ 31 4 8 0 9 12 1325 0 8 0] [ 1 12 17 10 11 2 0 1338 13 76] [ 11 13 17 31 19 42 9 6 1169 17] [ 11 4 4 21 106 11 2 62 24 1103]] 0.897041847042 [[1282 0 5 0 2 17 18 0 7 2] [ 0 1491 7 4 1 2 6 1 2 6] [ 16 1 1299 29 12 3 8 12 33 1] [ 6 8 31 1282 2 39 4 23 63 13] [ 2 7 13 2 1177 2 12 12 8 123] [ 18 1 6 62 12 1024 19 3 42 18] [ 22 4 6 1 9 14 1334 0 7 0] [ 0 13 19 9 13 2 0 1336 15 73] [ 6 11 11 32 20 46 8 5 1179 16] [ 12 2 3 23 92 5 2 66 25 1118]] 0.903463203463 With only 11 components, a random forest acheived an accuracy of 0.9034632034632034.
9. It is possible that extra features in the dataset even lead to overfitting or the curse of dimensionality. Do you have evidence of this happening for this dataset? Can you support your evidence with a visual or table? To avoid printing out all of the metric results, I created another function called fit_random_forest_classifier2. I ran through a significant number of components to create the visual for the solution, but I strongly recommend you look in the range below 100 principal components!
# I would highly recommend not running the below code, as it had to run overnight to complete.
# Instead, you can run a smaller number of components that still allows you to see enough.
#accs = []
#comps = []
#for comp in range(2, 700):
# comps.append(comp)
# pca, X_pca = do_pca(comp, X)
# acc = fit_random_forest_classifier2(X_pca, y)
# accs.append(acc)
plt.plot(comps, accs, 'bo');
plt.xlabel('Number of Components');
plt.ylabel('Accuracy');
plt.title('Number of Components by Accuracy');
# The max accuracy and corresponding number of components
np.max(accs), comps[np.where(accs == np.max(accs))[0][0]]
(0.94126984126984126, 61)
Here you can see that the accuracy quickly levels off. The maximum accuracy is actually acheived at 61 principal components. Given the slight negative trend also indicates that the final components are mostly containing noise. The 61 components here contain the information needed to be able to determine the images nearly to the same ability as using the entire image. Next, let's take a closer look at exactly what other information we get from PCA, and how we can interpret it.