ML/DL Notes
  • Home
  • Advacned Notebooks
    • Optional Lab: Model Evaluation and Selection
    • Optional Lab: Diagnosing Bias and Variance
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Binary
    • Optional Lab - Neurons and Layers
    • Optional Lab - Simple Neural Network
    • Optional Lab - Simple Neural Network
    • Practice Lab: Neural Networks for Handwritten Digit Recognition, Multiclass
    • Optional Lab: Back propagation using a computation graph
    • Optional Lab - Derivatives
    • Optional Lab - Multi-class Classification
    • Optional Lab - ReLU activation
    • Optional Lab - Softmax Function
    • Practice Lab: Advice for Applying Machine Learning
    • Practice Lab: Decision Trees
    • Ungraded Lab: Decision Trees
    • Ungraded Lab - Trees Ensemble
  • Advanced Learning Algorithms
    • Advanced
  • Reinforcement learning notebooks
    • Deep Q-Learning - Lunar Lander
    • State Action Value Function Example
    • Utils
  • Supervised Legacy
    • Classic_Supervised
  • Supervised Notebooks
    • Optional Lab: Cost Function
    • Optional Lab: Gradient Descent for Linear Regression
    • Optional Lab: Python, NumPy and Vectorization
    • Optional Lab: Multiple Variable Linear Regression
    • Optional Lab: Feature scaling and Learning Rate (Multi-variable)
    • Optional Lab: Feature Engineering and Polynomial Regression
    • Optional Lab: Linear Regression using Scikit-Learn
    • Practice Lab: Linear Regression
    • Optional Lab: Classification
    • Optional Lab: Logistic Regression
    • Optional Lab: Logistic Regression, Decision Boundary
    • Optional Lab: Logistic Regression, Logistic Loss
    • Optional Lab: Cost Function for Logistic Regression
    • Optional Lab: Gradient Descent for Logistic Regression
    • Ungraded Lab: Logistic Regression using Scikit-Learn
    • Ungraded Lab: Overfitting
    • Optional Lab - Regularized Cost and Gradient
    • Logistic Regression
  • Udacity
    • Changing K Solution
    • DBSCAN Lab
    • Feature Scaling Solution
    • 2. KMeans vs GMM on The Iris Dataset
    • Selecting the optimal number of clusters using Silhouette Scoring on KMeans and GMM clustering - SOLUTION
    • Implementing the Gradient Descent Algorithm
    • Hierarchical Clustering Lab
    • Independent Component Analysis Lab
    • Interpret PCA Results Solution
    • Mini Batch Gradient Descent
    • Multiple Linear Regression
    • PCA 1 Solution
    • PCA Mini Project Solution
    • Random Projection Solution
    • Predicting Student Admissions with Neural Networks
    • Perceptron algorithm
  • Unsupervised,Recommenders,Reinforcement Learning
    • Unsupervised
  • Unsupervised notebooks
    • K-means Clustering
    • Practice lab: Collaborative Filtering Recommender Systems
    • PCA - An example on Exploratory Data Analysis
    • Practice lab: Deep Learning for Content-Based Filtering
  • Previous
  • Next
  • Optional Lab: Back propagation using a computation graph
    • Computation Graph
    • Computation Graph of a Simple Neural Network
    • Congratulations!

Optional Lab: Back propagation using a computation graph¶

Working through this lab will give you insight into a key algorithm used by most machine learning frameworks. Gradient descent requires the derivative of the cost with respect to each parameter in the network. Neural networks can have millions or even billions of parameters. The back propagation algorithm is used to compute those derivatives. Computation graphs are used to simplify the operation. Let's dig into this below.

In [1]:
Copied!
from sympy import *
import numpy as np
import re
%matplotlib widget
import matplotlib.pyplot as plt
from matplotlib.widgets import TextBox
from matplotlib.widgets import Button
import ipywidgets as widgets
from lab_utils_backprop import *
from sympy import * import numpy as np import re %matplotlib widget import matplotlib.pyplot as plt from matplotlib.widgets import TextBox from matplotlib.widgets import Button import ipywidgets as widgets from lab_utils_backprop import *

Computation Graph¶

A computation graph simplifies the computation of complex derivatives by breaking them into smaller steps. Let's see how this works.

Let's calculate the derivative of this slightly complex expression, $J = (2+3w)^2$. We would like to find the derivative of $J$ with respect to $w$ or $\frac{\partial J}{\partial w}$.

In [2]:
Copied!
plt.close("all")
plt_network(config_nw0, "./images/C2_W2_BP_network0.PNG")
plt.close("all") plt_network(config_nw0, "./images/C2_W2_BP_network0.PNG")
Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …
Out[2]:
<lab_utils_backprop.plt_network at 0x72988c8098d0>

Above, you can see we broke the expression into two nodes which we can work on independently. If you already have a good understanding of the process from the lecture, you can go ahead and fill in the boxes in the diagram above. You will want to first fill in the blue boxes going left to right and then fill in the green boxes starting on the right and moving to the left. If you have the correct values, the values will show as green or blue. If the value is incorrect, it will be red. Note, the interactive graphic is not particularly robust. If you run into trouble with the interface, run the cell above again to restart.

If you are unsure of the process, we will work this example step by step below.

Forward Propagation¶

Let's calculate the values in the forward direction.

Just a note about this section. It uses global variables and reuses them as the calculation progresses. If you run cells out of order, you may get funny results. If you do, go back to this point and run them in order.

In [3]:
Copied!
w = 3
a = 2+3*w
J = a**2
print(f"a = {a}, J = {J}")
w = 3 a = 2+3*w J = a**2 print(f"a = {a}, J = {J}")
a = 11, J = 121

You can fill these values in the blue boxes above.

Backprop¶

No description has been provided for this image Backprop is the algorithm we use to calculate derivatives. As described in the lectures, backprop starts at the right and moves to the left. The first node to consider is $J = a^2 $ and the first step is to find $\frac{\partial J}{\partial a}$

$\frac{\partial J}{\partial a}$¶

Arithmetically¶

Find $\frac{\partial J}{\partial a}$ by finding how $J$ changes as a result of a little change in $a$. This is described in detail in the derivatives optional lab.

In [4]:
Copied!
a_epsilon = a + 0.001       # a epsilon
J_epsilon = a_epsilon**2    # J_epsilon
k = (J_epsilon - J)/0.001   # difference divided by epsilon
print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_da ~= k = {k} ")
a_epsilon = a + 0.001 # a epsilon J_epsilon = a_epsilon**2 # J_epsilon k = (J_epsilon - J)/0.001 # difference divided by epsilon print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_da ~= k = {k} ")
J = 121, J_epsilon = 121.02200099999999, dJ_da ~= k = 22.000999999988835 

$\frac{\partial J}{\partial a}$ is 22 which is $2\times a$. Our result is not exactly $2 \times a$ because our epsilon value is not infinitesimally small.

Symbolically¶

Now, let's use SymPy to calculate derivatives symbolically as we did in the derivatives optional lab. We will prefix the name of the variable with an 's' to indicate this is a symbolic variable.

In [5]:
Copied!
sw,sJ,sa = symbols('w,J,a')
sJ = sa**2
sJ
sw,sJ,sa = symbols('w,J,a') sJ = sa**2 sJ
Out[5]:
$\displaystyle a^{2}$
In [6]:
Copied!
sJ.subs([(sa,a)])
sJ.subs([(sa,a)])
Out[6]:
$\displaystyle 121$
In [7]:
Copied!
dJ_da = diff(sJ, sa)
dJ_da
dJ_da = diff(sJ, sa) dJ_da
Out[7]:
$\displaystyle 2 a$

So, $\frac{\partial J}{\partial a} = 2a$. When $a=11$, $\frac{\partial J}{\partial a} = 22$. This matches our arithmetic calculation above. If you have not already done so, you can go back to the diagram above and fill in the value for $\frac{\partial J}{\partial a}$.

$\frac{\partial J}{\partial w}$¶

No description has been provided for this image Moving from right to left, the next value we would like to compute is $\frac{\partial J}{\partial w}$. To do this, we first need to calculate $\frac{\partial a}{\partial w}$ which describes how the output of this node, $a$, changes when the input $w$ changes a little bit.

Arithmetically¶

Find $\frac{\partial a}{\partial w}$ by finding how $a$ changes as a result of a little change in $w$.

In [8]:
Copied!
w_epsilon = w + 0.001       # a  plus a small value, epsilon
a_epsilon = 2 + 3*w_epsilon
k = (a_epsilon - a)/0.001   # difference divided by epsilon
print(f"a = {a}, a_epsilon = {a_epsilon}, da_dw ~= k = {k} ")
w_epsilon = w + 0.001 # a plus a small value, epsilon a_epsilon = 2 + 3*w_epsilon k = (a_epsilon - a)/0.001 # difference divided by epsilon print(f"a = {a}, a_epsilon = {a_epsilon}, da_dw ~= k = {k} ")
a = 11, a_epsilon = 11.003, da_dw ~= k = 3.0000000000001137 

Calculated arithmetically, $\frac{\partial a}{\partial w} \approx 3$. Let's try it with SymPy.

In [9]:
Copied!
sa = 2 + 3*sw
sa
sa = 2 + 3*sw sa
Out[9]:
$\displaystyle 3 w + 2$
In [10]:
Copied!
da_dw = diff(sa,sw)
da_dw
da_dw = diff(sa,sw) da_dw
Out[10]:
$\displaystyle 3$

The next step is the interesting part:

  • We know that a small change in $w$ will cause $a$ to change by 3 times that amount.
  • We know that a small change in $a$ will cause $J$ to change by $2\times a$ times that amount. (a=11 in this example)
    so, putting these together,
  • We know that a small change in $w$ will cause $J$ to change by $3 \times 2\times a$ times that amount.

These cascading changes go by the name of the chain rule. It can be written like this: $$\frac{\partial J}{\partial w} = \frac{\partial a}{\partial w} \frac{\partial J}{\partial a} $$

It's worth spending some time thinking this through if it is not clear. This is a key take-away.

Let's try calculating it:

In [11]:
Copied!
dJ_dw = da_dw * dJ_da
dJ_dw
dJ_dw = da_dw * dJ_da dJ_dw
Out[11]:
$\displaystyle 6 a$

And $a$ is 11 in this example so $\frac{\partial J}{\partial w} = 66$. We can check this arithmetically:

In [12]:
Copied!
w_epsilon = w + 0.001
a_epsilon = 2 + 3*w_epsilon
J_epsilon = a_epsilon**2
k = (J_epsilon - J)/0.001   # difference divided by epsilon
print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_dw ~= k = {k} ")
w_epsilon = w + 0.001 a_epsilon = 2 + 3*w_epsilon J_epsilon = a_epsilon**2 k = (J_epsilon - J)/0.001 # difference divided by epsilon print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_dw ~= k = {k} ")
J = 121, J_epsilon = 121.06600900000001, dJ_dw ~= k = 66.0090000000082 

OK! You can now fill the values for $\frac{\partial a}{\partial w}$ and $\frac{\partial J}{\partial w}$ in the diagram if you have not already done so.

Another view
One could visualize these cascading changes this way:
No description has been provided for this image
A small change in $w$ is multiplied by $\frac{\partial a}{\partial w}$ resulting in a change that is 3 times as large. This larger change is then multiplied by $\frac{\partial J}{\partial a}$ resulting in a change that is now $3 \times 22 = 66$ times larger.

Computation Graph of a Simple Neural Network¶

Below is a graph of the neural network used in the lecture with different values. Try and fill in the values in the boxes. Note, the interactive graphic is not particularly robust. If you run into trouble with the interface, run the cell below again to restart.

In [13]:
Copied!
plt.close("all")
plt_network(config_nw1, "./images/C2_W2_BP_network1.PNG")
plt.close("all") plt_network(config_nw1, "./images/C2_W2_BP_network1.PNG")
Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …
Out[13]:
<lab_utils_backprop.plt_network at 0x72988beb7150>

Below, we will go through the computations required to fill in the above computation graph in detail. We start with the forward path.

Forward propagation¶

The calculations in the forward path are the ones you have recently learned for neural networks. You can compare the values below to those you calculated for the diagram above.

In [14]:
Copied!
# Inputs and parameters
x = 2
w = -2
b = 8
y = 1
# calculate per step values   
c = w * x
a = c + b
d = a - y
J = d**2/2
print(f"J={J}, d={d}, a={a}, c={c}")
# Inputs and parameters x = 2 w = -2 b = 8 y = 1 # calculate per step values c = w * x a = c + b d = a - y J = d**2/2 print(f"J={J}, d={d}, a={a}, c={c}")
J=4.5, d=3, a=4, c=-4

Backward propagation (Backprop)¶

No description has been provided for this image As described in the lectures, backprop starts at the right and moves to the left. The first node to consider is $J = \frac{1}{2}d^2 $ and the first step is to find $\frac{\partial J}{\partial d}$

$\frac{\partial J}{\partial d}$¶

Arithmetically¶

Find $\frac{\partial J}{\partial d}$ by finding how $J$ changes as a result of a little change in $d$.

In [15]:
Copied!
d_epsilon = d + 0.001
J_epsilon = d_epsilon**2/2
k = (J_epsilon - J)/0.001   # difference divided by epsilon
print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_dd ~= k = {k} ")
d_epsilon = d + 0.001 J_epsilon = d_epsilon**2/2 k = (J_epsilon - J)/0.001 # difference divided by epsilon print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_dd ~= k = {k} ")
J = 4.5, J_epsilon = 4.5030005, dJ_dd ~= k = 3.0004999999997395 

$\frac{\partial J}{\partial d}$ is 3, which is the value of $d$. Our result is not exactly $d$ because our epsilon value is not infinitesimally small.

Symbolically¶

Now, let's use SymPy to calculate derivatives symbolically, as we did in the derivatives optional lab. We will prefix the name of the variable with an 's' to indicate this is a symbolic variable.

In [16]:
Copied!
sx,sw,sb,sy,sJ = symbols('x,w,b,y,J')
sa, sc, sd = symbols('a,c,d')
sJ = sd**2/2
sJ
sx,sw,sb,sy,sJ = symbols('x,w,b,y,J') sa, sc, sd = symbols('a,c,d') sJ = sd**2/2 sJ
Out[16]:
$\displaystyle \frac{d^{2}}{2}$
In [17]:
Copied!
sJ.subs([(sd,d)])
sJ.subs([(sd,d)])
Out[17]:
$\displaystyle \frac{9}{2}$
In [18]:
Copied!
dJ_dd = diff(sJ, sd)
dJ_dd
dJ_dd = diff(sJ, sd) dJ_dd
Out[18]:
$\displaystyle d$

So, $\frac{\partial J}{\partial d}$ = d. When $d=3$, $\frac{\partial J}{\partial d}$ = 3. This matches our arithmetic calculation above. If you have not already done so, you can go back to the diagram above and fill in the value for $\frac{\partial J}{\partial d}$.

$\frac{\partial J}{\partial a}$¶

No description has been provided for this image Moving from right to left, the next value we would like to compute is $\frac{\partial J}{\partial a}$. To do this, we first need to calculate $\frac{\partial d}{\partial a}$ which describes how the output of this node changes when the input $a$ changes a little bit. (Note, we are not interested in how the output changes when $y$ changes since $y$ is not a parameter.)

Arithmetically¶

Find $\frac{\partial d}{\partial a}$ by finding how $d$ changes as a result of a little change in $a$.

In [19]:
Copied!
a_epsilon = a + 0.001         # a  plus a small value
d_epsilon = a_epsilon - y
k = (d_epsilon - d)/0.001   # difference divided by epsilon
print(f"d = {d}, d_epsilon = {d_epsilon}, dd_da ~= k = {k} ")
a_epsilon = a + 0.001 # a plus a small value d_epsilon = a_epsilon - y k = (d_epsilon - d)/0.001 # difference divided by epsilon print(f"d = {d}, d_epsilon = {d_epsilon}, dd_da ~= k = {k} ")
d = 3, d_epsilon = 3.0010000000000003, dd_da ~= k = 1.000000000000334 

Calculated arithmetically, $\frac{\partial d}{\partial a} \approx 1$. Let's try it with SymPy.

Symbolically¶

In [20]:
Copied!
sd = sa - sy
sd
sd = sa - sy sd
Out[20]:
$\displaystyle a - y$
In [21]:
Copied!
dd_da = diff(sd,sa)
dd_da
dd_da = diff(sd,sa) dd_da
Out[21]:
$\displaystyle 1$

Calculated arithmetically, $\frac{\partial d}{\partial a}$ also equals 1.

The next step is the interesting part, repeated again in this example:

  • We know that a small change in $a$ will cause $d$ to change by 1 times that amount.
  • We know that a small change in $d$ will cause $J$ to change by $d$ times that amount. (d=3 in this example)
    so, putting these together,
  • We know that a small change in $a$ will cause $J$ to change by $1\times d$ times that amount.

This is again the chain rule. It can be written like this: $$\frac{\partial J}{\partial a} = \frac{\partial d}{\partial a} \frac{\partial J}{\partial d} $$

Let's try calculating it:

In [22]:
Copied!
dJ_da = dd_da * dJ_dd
dJ_da
dJ_da = dd_da * dJ_dd dJ_da
Out[22]:
$\displaystyle d$

And $d$ is 3 in this example so $\frac{\partial J}{\partial a} = 3$. We can check this arithmetically:

In [23]:
Copied!
a_epsilon = a + 0.001
d_epsilon = a_epsilon - y
J_epsilon = d_epsilon**2/2
k = (J_epsilon - J)/0.001   
print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_da ~= k = {k} ")
a_epsilon = a + 0.001 d_epsilon = a_epsilon - y J_epsilon = d_epsilon**2/2 k = (J_epsilon - J)/0.001 print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_da ~= k = {k} ")
J = 4.5, J_epsilon = 4.503000500000001, dJ_da ~= k = 3.0005000000006277 

OK, they match! You can now fill the values for $\frac{\partial d}{\partial a}$ and $\frac{\partial J}{\partial a}$ in the diagram if you have not already done so.

The steps in backprop
Now that you have worked through several nodes, we can write down the basic method:
working right to left, for each node:

  • calculate the local derivative(s) of the node
  • using the chain rule, combine with the derivative of the cost with respect to the node to the right.

The 'local derivative(s)' are the derivative(s) of the output of the current node with respect to all inputs or parameters.

Let's continue the job. We'll be a bit less verbose now that you are familiar with the method.

$\frac{\partial J}{\partial c}$, $\frac{\partial J}{\partial b}$¶

No description has been provided for this imageThe next node has two derivatives of interest. We need to calculate $\frac{\partial J}{\partial c}$ so we can propagate to the left. We also want to calculate $\frac{\partial J}{\partial b}$. Finding the derivative of the cost with respect to the parameters $w$ and $b$ is the object of backprop. We will find the local derivatives, $\frac{\partial a}{\partial c}$ and $\frac{\partial a}{\partial b}$ first and then combine those with the derivative coming from the right, $\frac{\partial J}{\partial a}$.

In [24]:
Copied!
# calculate the local derivatives da_dc, da_db
sa = sc + sb
sa
# calculate the local derivatives da_dc, da_db sa = sc + sb sa
Out[24]:
$\displaystyle b + c$
In [25]:
Copied!
da_dc = diff(sa,sc)
da_db = diff(sa,sb)
print(da_dc, da_db)
da_dc = diff(sa,sc) da_db = diff(sa,sb) print(da_dc, da_db)
1 1
In [26]:
Copied!
dJ_dc = da_dc * dJ_da
dJ_db = da_db * dJ_da
print(f"dJ_dc = {dJ_dc},  dJ_db = {dJ_db}")
dJ_dc = da_dc * dJ_da dJ_db = da_db * dJ_da print(f"dJ_dc = {dJ_dc}, dJ_db = {dJ_db}")
dJ_dc = d,  dJ_db = d

And in our example, d = 3

$\frac{\partial J}{\partial w}$¶

No description has been provided for this image The last node in this example calculates c. Here, we are interested in how J changes with respect to the parameter w. We will not back propagate to the input $x$, so we are not interested in $\frac{\partial J}{\partial x}$. Let's start by calculating $\frac{\partial c}{\partial w}$.

In [27]:
Copied!
# calculate the local derivative
sc = sw * sx
sc
# calculate the local derivative sc = sw * sx sc
Out[27]:
$\displaystyle w x$
In [28]:
Copied!
dc_dw = diff(sc,sw)
dc_dw
dc_dw = diff(sc,sw) dc_dw
Out[28]:
$\displaystyle x$

This derivative is a bit more exciting than the last one. This will vary depending on the value of $x$. This is 2 in our example.

Combine this with $\frac{\partial J}{\partial c}$ to find $\frac{\partial J}{\partial w}$.

In [29]:
Copied!
dJ_dw = dc_dw * dJ_dc
dJ_dw
dJ_dw = dc_dw * dJ_dc dJ_dw
Out[29]:
$\displaystyle d x$
In [30]:
Copied!
print(f"dJ_dw = {dJ_dw.subs([(sd,d),(sx,x)])}")
print(f"dJ_dw = {dJ_dw.subs([(sd,d),(sx,x)])}")
dJ_dw = 2*d

$d=3$, so $\frac{\partial J}{\partial w} = 6$ for our example.
Let's test this arithmetically:

In [31]:
Copied!
J_epsilon = ((w+0.001)*x+b - y)**2/2
k = (J_epsilon - J)/0.001  
print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_dw ~= k = {k} ")
J_epsilon = ((w+0.001)*x+b - y)**2/2 k = (J_epsilon - J)/0.001 print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_dw ~= k = {k} ")
J = 4.5, J_epsilon = 4.506002, dJ_dw ~= k = 6.001999999999619 

They match! Great. You can add $\frac{\partial J}{\partial w}$ to the diagram above and our analysis is complete.

Congratulations!¶

You've worked through an example of back propagation using a computation graph. You can apply this to larger examples by following the same node by node approach.

In [ ]:
Copied!

In [ ]:
Copied!


Documentation built with MkDocs.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search