Lab: Random Projection¶
In the previous video, you saw an example of working with some MNIST digits data. A link to the dataset can be found here: http://yann.lecun.com/exdb/mnist/.
First, let's import the necessary libraries. Notice, there are also some imports from a file called helper_functions, which contains the functions used in the previous video.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from helper_functions import fit_random_forest_classifier, plot_rp, show_images_by_digit
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
I. Use random projection to reduce dimensions¶
In the first section, we will use sparse random projection to transform data from high dimensions to low dimensions and compare the classification accuracy before and after transformation.
1. Use pandas to read in the dataset, which can be found at the following address './data/train.csv'. Take a look at info about the data using head, tail, describe, info, etc. You can learn more about the data values from the article here: https://homepages.inf.ed.ac.uk/rbf/HIPR2/value.htm.
train = pd.read_csv('./data/train.csv')
train.fillna(0, inplace=True)
2. Create a vector called y that holds the label column of the dataset. Store all other columns holding the pixel data of your images in X.
# save the labels to a Pandas series target
y = train['label']
# Drop the label feature
X = train.drop("label",axis=1)
3. Now use the show_images_by_digit function from the helper_functions module to take a look some of the 1's, 2's, 3's, or any other value you are interested in looking at. Do they all look like what you would expect?
show_images_by_digit(2) # Try looking at a few other digits
4. Now that you have had a chance to look through some of the data, you can try some different algorithms to see what works well to use the X matrix to predict the response well. If you would like to use the imported random forests classification, you can run the code below, but you might also try any of the supervised techniques you learned in the previous course to see what works best.
If you decide to put together your own classifier, remember the 4 steps to this process:
I. Instantiate your model. (with all the hyperparameter values you care about)
II. Fit your model. (to the training data)
III. Predict using your fitted model. (on the test data)
IV. Score your model. (comparing the predictions to the actual values on the test data)
You can also try a grid search to see if you can improve on your initial predictions.
fit_random_forest_classifier(X, y)
[[202 0 1 0 0 0 6 0 0 0] [ 0 236 2 0 0 1 1 2 1 0] [ 0 4 215 1 2 0 1 5 0 0] [ 3 0 5 174 0 5 0 1 1 2] [ 1 0 2 0 169 0 2 0 1 3] [ 3 1 0 4 0 173 3 0 2 0] [ 1 0 1 0 0 1 206 0 2 0] [ 0 1 7 0 5 0 0 206 2 4] [ 1 1 0 5 2 2 0 0 188 3] [ 2 1 1 2 9 2 0 2 4 185]] accuracy: 0.9389716482460355
5. Now the purpose of this lesson, to look at Randome Projection. In the cell below, reduce the dimension of the training data X using the SparseRandomProjection in random_projection module. You can use 0.5 as the value of epsilon. Store your variables in rp and X_rp
# TODO: performs random projection to reduce the dataset dimension.
# To start, use 0.5 as the epsilon value.
from sklearn.random_projection import SparseRandomProjection
rp = SparseRandomProjection(eps = 0.5)
X_rp = rp.fit_transform(X)
To look at the dimension of the transformed data, you can use the n_components_ attributes.
rp_dim = rp.n_components_
X_dim = X.shape[1]
print("The orignial data has {} dimensions and it is reduced to {} after random projection.".format(X_dim, rp_dim))
The orignial data has 784 dimensions and it is reduced to 419 after random projection.
6. The X_rp has moved about half the original features. Use the space below to fit a model using these two features to predict the written value. You can use the random forest model by running fit_random_forest_classifier the same way as in the video. How well does it perform?
# Todo: fit the new data with a classifier
fit_random_forest_classifier(X_rp, y)
[[200 0 1 2 0 0 5 0 1 0] [ 0 234 2 1 0 0 1 2 3 0] [ 1 4 210 2 4 1 1 3 2 0] [ 3 0 7 167 0 5 1 1 5 2] [ 0 0 4 0 166 0 2 0 1 5] [ 1 0 0 8 3 168 3 1 0 2] [ 3 0 2 0 2 5 198 0 0 1] [ 1 2 4 0 7 0 0 204 3 4] [ 2 3 3 7 2 4 2 0 174 5] [ 4 0 2 2 10 3 0 10 4 173]] accuracy: 0.9101393560788082
8. Epsilon is the level of errors that determins how much the transformed data is distorded from the original data. Now, see if you can change the epsilon to reduce the dimension even more. What is the accuracy of the classifying model?
# TODO: write a loop to transform X using different epsilon
for sample_eps in np.arange(0.5, 1,0.2):
rp = SparseRandomProjection(eps = sample_eps)
X_rp = rp.fit_transform(X)
print(f"With epsilon = {sample_eps:.2f}, the transformed data has {X_rp.shape[1]} components. The random forest achieved an accuracy of:")
fit_random_forest_classifier(X_rp, y)
print("------------------------------")
With epsilon = 0.50, the transformed data has 419 components. The random forest achieved an accuracy of: [[195 0 2 0 1 0 8 0 3 0] [ 0 235 2 0 0 2 1 1 2 0] [ 2 2 206 2 3 0 2 8 2 1] [ 2 0 9 167 0 7 1 1 2 2] [ 0 0 4 0 164 0 2 1 0 7] [ 2 1 1 11 3 163 2 0 3 0] [ 3 0 5 0 0 2 201 0 0 0] [ 1 1 7 2 4 0 0 205 1 4] [ 1 1 1 12 1 11 0 0 173 2] [ 3 2 0 2 14 0 1 11 3 172]] accuracy: 0.9038923594425757 ------------------------------ With epsilon = 0.70, the transformed data has 267 components. The random forest achieved an accuracy of: [[192 0 3 0 1 2 8 0 2 1] [ 0 233 3 1 0 0 2 1 3 0] [ 4 5 208 0 2 0 3 5 1 0] [ 2 0 8 166 0 9 0 0 2 4] [ 1 0 1 0 161 0 4 1 0 10] [ 2 1 2 8 2 164 4 1 2 0] [ 3 0 2 0 3 2 200 1 0 0] [ 0 3 5 2 4 0 1 201 2 7] [ 1 1 2 14 2 5 2 0 173 2] [ 2 0 1 2 13 1 0 8 5 176]] accuracy: 0.9005285920230658 ------------------------------ With epsilon = 0.90, the transformed data has 216 components. The random forest achieved an accuracy of: [[197 0 2 0 1 2 7 0 0 0] [ 0 235 2 0 0 1 1 1 2 1] [ 1 4 208 2 1 2 1 6 2 1] [ 8 0 9 160 0 7 1 1 5 0] [ 0 0 3 0 165 0 2 0 1 7] [ 3 2 1 4 1 171 2 0 2 0] [ 2 0 2 0 0 2 204 1 0 0] [ 0 0 7 0 2 0 1 207 2 6] [ 2 1 2 11 2 8 2 1 171 2] [ 1 0 0 2 17 0 0 9 6 173]] accuracy: 0.9086977414704469 ------------------------------
9. As you can see, the accuracy is still very high after recuding more than half of the columns. And higer value of epsilon gives you a dataset with lower number of components. epsilon is an important parameter in the the Johnson-Lindenstrauss lemma. Let's see how epsilon changes the number of components projected.
# Calulate the number of components with varying eps
from sklearn.random_projection import johnson_lindenstrauss_min_dim
eps = np.arange(0.1, 1, 0.01)
n_comp = johnson_lindenstrauss_min_dim(n_samples=1e6, eps=eps)
plt.plot(eps, n_comp, 'bo');
plt.xlabel('eps');
plt.ylabel('Number of Components');
plt.title('Number of Components by eps');
II. Visualize the random projection¶
In this section, we will take a look at how the projection works.
1. We load the number of samples and number of components in the original data X. Store them in X_sample and X_comp.
# TODO: find the number of samples and components in X
X_sample, X_comp = X.shape
print("The orignial data has {} samples with dimension {}.".format(X_sample, X_comp))
The orignial data has 6304 samples with dimension 784.
2. Project the data onto a space with n_components using SparseRamdomProjection. Sometimes you may want to define the dimension of projected space directly without guessing it based on the epsilon.
In random projection, besides using a predefine an epsilon to determine the dimension, you can also use the n_components parameter to define the number of components in the transformed data.
# TODO: define a n_components and perform random projection on data X.
# Store the transformed data in a `X_rp` variable
n_components = 30
rp = SparseRandomProjection(n_components=n_components)
X_rp = rp.fit_transform(X)
3. Perform the following helper function to plot two figures.
The first figure shows how the distribution of pairwise distance of original data against that of the projected data.
The second figure shows the ratio of pairwise distance in projected data and original data (projected distance/original distance)
It may take about 2-5 min to plot the figures due to the large data dimension.
plot_rp(X, X_rp, n_components)
4. Change number of n_components to 30, 100, 700. What do you observe?
Answer: from the 6 figures with different n_components, we can see that when the projected space has relatively low dimension (n_components=30), the space is very distorted: the 2D pairwise distance distribution forms a circular shape and the distance ratio is widely spread. When the projected space is high-dementional or even close to the original dimension (n_components=700), the pairwise is very well perserved: the 2D pairwise distance distribution forms a liner line and the most of the distance ratios are close to 1.