Backpropagation
Derivation
Image from here
In order to run SGD, for a single observation \((x, y)\) we need to find the derivatives of the cost function \(C\) with respect to network weights \(W^{(i)}\).
First we write down the feed forward network:
\begin{eqnarray*} z^{(1)} &=& W^{(1)}x\\ a^{(1)} &=& f^{(1)}(z^{(1)}) = f^{(1)}(W^{(1)}x)\\ \\ z^{(2)} &=& W^{(2)}a^{(1)}\\ a^{(2)} &=& f^{(2)}(z^{(2)}) = f^{(2)}(W^{(2)}a^{(1)})\\ \\ z^{(3)} &=& W^{(3)}a^{(2)}\\ a^{(3)} &=& f^{(3)}(z^{(3)}) = f^{(3)}(W^{(3)}a^{(2)})\\ \end{eqnarray*}
There is another common formulation that uses \(z = Wx+b\) instead of just \(Wx\), which is equivalent to the above if in each input and hidden layer the last element of \(f^{(i)}\) is set to the constant function 1 instead of the sigmoid function. Below we will follow this setting.
The cost function is \(C(y, f^{(3)}(W^{(3)} f^{(2)}(W^{(2)} f^{(1)}(W^{(1)} x ) ) ))\), and its derivative with respect to \(x\) is
each term of which is a total derivative. Recall that the total derivative of \(f: \mathbb R^m \rightarrow \mathbb R^n\) is defined as the linear map \(df_a\) such that
The matrix form of this linear map is of size \(n\times m\).
To understand terms of the form \(da^{(i)}/dz^{(i)}\) (the 2nd, 4th and 6th total derivatives on the right hand side) first note that \(f^{(i)}\) is an \(\mathbb R^n\rightarrow\mathbb R^n\) function, \(n\) the number of layers in the \(i\)th layer (layer is zero indexed). For example for the output layer \(f^{(3)}\) is \(\mathbb R^4\rightarrow\mathbb R^4\). Thus the total derivative \(da^{(3)}/dz^{(3)}\) is of size \(4\times 4\). Since the activation function is applied element by element, the \(z^{(3)}_i\) has no impact on \(a^{(3)}_j\) if \(i\neq j\). Thus the total derivative is a diagonal matrix
where \(f_j^{(3)}\) is the element function in \(f^{(3)}\). From now on we denote this diagonal matrix by \((f^{(3)})^\prime\) for simplicity. The hidden layers have different total derivatives. Since the last element functions are set to a constant 1, the last element of the total derivative matrix is 0, that is
and same for \(da^{(1)}/dz^{(1)}\). We denote these matrices by \((f^{(2)})^\prime\) and \((f^{(1)})^\prime\).
Terms of the form \(dz^{(i)}/da^{(i-1)}\) (the 3rd, 5th, and 7th terms on the right hand side) is obvious: It is simply \(W^{(i)}\), and the complete total derivative can be written as
\begin{eqnarray*} \frac{dC}{dx} &=& \frac{dC}{da^{(3)}} \cdot \frac{da^{(3)}}{dz^{(3)}} \cdot \frac{dz^{(3)}}{da^{(2)}} \cdot \frac{da^{(2)}}{dz^{(2)}} \cdot \frac{dz^{(2)}}{da^{(1)}} \cdot \frac{da^{(1)}}{dz^{(1)}} \cdot \frac{dz^{(1)}}{dx}\\\\ &=& \frac{dC}{da^{(3)}} \cdot (f^{(3)})^\prime \cdot W^{(3)} \cdot (f^{(2)})^\prime \cdot W^{(2)} \cdot (f^{(1)})^\prime \cdot W^{(1)}. \end{eqnarray*}
We introduce the following notation to simplify the argument below:
\begin{eqnarray*} {\delta^{(3)}}^T &=& \frac{dC}{dz^{(3)}} = \frac{dC}{da^{(3)}} \cdot (f^{(3)})^\prime, \\ {\delta^{(2)}}^T &=& \frac{dC}{dz^{(2)}} = \frac{dC}{da^{(3)}} \cdot (f^{(3)})^\prime \cdot W^{(3)} \cdot (f^{(2)})^\prime, \\ {\delta^{(1)}}^T &=& \frac{dC}{dz^{(1)}} = \frac{dC}{da^{(3)}} \cdot (f^{(3)})^\prime \cdot W^{(3)} \cdot (f^{(2)})^\prime \cdot W^{(2)} \cdot (f^{(1)})^\prime. \end{eqnarray*}
Those are row vectors, and we definite \(\delta^{(i)}\) as column vectors.
To perform SGD, what we really want to find is terms like
This notation is inconsistent with the previous ones because \(W^{(i)}\) is a matrix. Let us flatten it as \(\bar W^{(i)}\) (in the same order like the numpy flatten function, but treat it as a column vector, or \(\bar W^{(i)} = \text{vec}\left({W^{(i)}}^T\right)\) formally) and write
We know how to compute \(\delta^{(i)}\) from the above discussion. To find \(dz^{(i)}/d\bar W^{(i)}\), take \(i=3\) as an example, note that \(z^{(3)} = W^{(3)}_{4\times 7}a^{(2)}\). For simplicity we omit the superscripts and write \(z = Wa\), or equivalently, \(z^T = a^TW^T\). Matrix vectorization has a property that \(\text{vec}(AB) = (I_m \otimes A)\text{vec}(B)\), where \(\otimes\) stands for the Kronecker product and \(m\) is the number of columns in \(B\). Using this property we can write
and conclude that
\begin{eqnarray*} \frac{dz^{(3)}}{d\bar W^{(3)}} &=& I_4\otimes \left(a^{(2)}\right)^T, \\ \frac{dC}{d\bar W^{(3)}} &=& {\delta^{(3)}}^T\cdot\left(I_4\otimes \left(a^{(2)}\right)^T\right)\\ &=& (\delta^{(3)}_1 \delta^{(3)}_2 \delta^{(3)}_3 \delta^{(3)}_4)_{1\times 4} \begin{pmatrix} \left(a^{(2)}\right)^T &&& 0\\ &\left(a^{(2)}\right)^T &&\\ &&\left(a^{(2)}\right)^T &\\ 0&&&\left(a^{(2)}\right)^T \end{pmatrix}_{4\times 28}. \end{eqnarray*}
This total derivative is a matrix of size \(1\times 28\), each element being the derivative of \(C\) with respect to one of the 28 elements of \(W^{(3)}\) which can be used for SGD. To write it down as a more compact formula, stack this back to matrix form to get
which is a \(4\times 7\) matrix, same size as \(W^{(3)}\).
To summarize in terms of gradient \(\nabla\), note that the total derivative \(dC/da^{(i)}\), a row vector, is the transpose of the gradient form \(\nabla_{a^{(i)}} C\). We have
\begin{eqnarray*} \delta^{(3)} &=& \nabla_{z^{(3)}} C = (f^{(3)})^\prime \cdot \nabla_{a^{(3)}} C, \\ \delta^{(2)} &=& \nabla_{z^{(2)}} C = (f^{(2)})^\prime \cdot \left(W^{(3)}\right)^T \cdot (f^{(3)})^\prime \cdot \nabla_{a^{(3)}} C, \\ \delta^{(1)} &=& \nabla_{z^{(1)}} C = (f^{(1)})^\prime \cdot \left(W^{(2)}\right)^T \cdot (f^{(2)})^\prime \cdot \left(W^{(3)}\right)^T \cdot (f^{(3)})^\prime \cdot \nabla_{a^{(3)}} C, \end{eqnarray*}
or as recursive formulas
\begin{eqnarray*} \delta^{(3)} &=& (f^{(3)})^\prime \cdot \nabla_{a^{(3)}} C, \\ \delta^{(2)} &=& (f^{(2)})^\prime \cdot \left(W^{(3)}\right)^T \cdot \delta^{(3)}, \\ \delta^{(1)} &=& (f^{(1)})^\prime \cdot \left(W^{(2)}\right)^T \cdot \delta^{(2)}, \end{eqnarray*}
and
These can be used to implement SGD.
Discussion
Any change in the last column of \(W^{(1)}\) does not have any impact to the final output \(a^{(3)}\) due to the last constant activation function 1, so may need to exclude before optimizing the network weights. Same for \(W^{(2)}\) but not \(W^{(3)}\). Or is it ok to just include them?
GD 的 cost function 如果用所有 data point 的 cost function 取平均,那 \(dC/dW^{(i)}\) 矩陣也是跑遍所有 data point 取平均(因為微分是線性的)
ANN 也是一種 regression。 Universal Approximation Theorem 說所有函數都可以表示成 \(g(x) = W_2f(W_1x + b_1)\),所以給定 data points \(\{(x_j, y_j)\}_{j=1}^m\),如果用訓練誤差的平方和 \(\sum(y_j - g(x_j; \theta))^2\) 作為 cost function,等同於對這個模型用 MLE 定參:\(y = g(x; \theta) + \epsilon\),這裡 \(\epsilon\) 是 iid normal。線性迴歸就只是特殊情況 \(g(x; a, b) = ax+b\)
最簡單的 ANN cost function 不是權重的 convex function 因為所有 neuron 重新排列組合又得到一樣的 ANN
Implementation
Output layer 用 identity function 作為 activation,所以 output 就直接是最後一層隱藏層 activation 的 weighted sum 而已。除了 output layer 之外每一層的最後一個 neuron activation 都被設成 1,代表常數項。
[3]:
import copy, time
import numpy as np
from pprint import pprint
from pandas import DataFrame
from scipy import optimize
sigmoid = lambda x: 1./(1+np.exp(-x))
batch = 10
# learn times table
data = DataFrame([[i, j, i*j] for i in range(11) for j in range(11) for k in range(batch)], columns=['x1', 'x2', 'y1'])
nNeuron = [5] # list of int, each is the number of neurons in a layer. Only has hidden layers here; number of input and output variables are figured out by data and later on prepended and appended.
x = data[[colName for colName in data.columns if 'x' in colName]] # training data -- input
y = data[[colName for colName in data.columns if 'y' in colName]] # training data -- output
W = []
nNeuron = [len(x.columns) + 1] + nNeuron + [len(y.columns)]
for i in range(1, len(nNeuron)):
n = nNeuron[i-1]
m = nNeuron[i]
W.append(np.random.normal(loc=0, scale=np.sqrt(2/(m + n)), size=(m, n)))
wShape = [mat.shape for mat in W]
def flatten(W):
return np.concatenate([mat.flatten() for mat in W])
def stack(flatW):
if not isinstance(flatW, np.ndarray):
flatW = np.array(flatW)
W = []
for m, n in wShape:
W.append(flatW[:m*n].reshape(m, n))
flatW = flatW[m*n:]
return W
def costFnc(W):
cost = 0
for j in data.index:
yj = y.iloc[j].values
xj = x.iloc[j].values
a = np.append(xj, 1)
for i, mat in enumerate(W):
z = mat @ a
if i != len(W)-1:
a = sigmoid(z)
a[-1] = 1
else:
a = z
cost += 0.5*sum((yj - a)**2)/len(data)
return cost
def costFncPrime(W):
derivative = [np.zeros(shape) for shape in wShape]
for j in data.index:
yj = y.iloc[j].values
xj = x.iloc[j].values
a = [np.append(xj, 1)]
z = [[]]
fp = [[]]
for i, mat in enumerate(W):
a[-1][-1] = 1
z.append(mat @ a[-1])
a.append(sigmoid(z[-1]) if i != len(W)-1 else z[-1])
if i != len(W)-1:
thisFp = sigmoid(z[-1])*(1 - sigmoid(z[-1]))
thisFp[-1] = 0
else:
thisFp = np.ones_like(z[-1])
fp.append(thisFp)
delta = [-fp[-1]*(yj - a[-1])]
for i in range(len(W), 1, -1):
delta = [fp[i-1]*(W[i-1].T @ delta[0])] + delta
for i, (thisDelta, thisA) in enumerate(zip(delta, a[:-1])):
derivative[i] += np.outer(thisDelta, thisA)/len(data)
return derivative
def costFncPrimeN(W, h=0.0001):
flatW = flatten(W)
derivative = []
for i, _ in enumerate(flatW):
wPlus = copy.deepcopy(flatW)
wMinus = copy.deepcopy(flatW)
wPlus[i] += h
wMinus[i] -= h
derivative.append((costFnc(stack(wPlus)) - costFnc(stack(wMinus)))/(2*h))
return stack(derivative)
### Gradient checking passed!
# df = DataFrame({'analytical': flatten(costFncPrime(W)), 'numerical': flatten(costFncPrimeN(W))})
# df.plot()
start_time = time.time()
res = optimize.fmin_l_bfgs_b(func=lambda flatW: costFnc(stack(flatW)),
x0=flatten(W),
fprime=lambda flatW: flatten(costFncPrime(stack(flatW)))
)
print("--- %s seconds ---" % (time.time() - start_time))
res
--- 3212.470108270645 seconds ---
[3]:
(array([-4.21929750e-02, -2.62976844e-01, 3.35119940e+00, -2.59658129e+01,
1.64540644e+02, -1.45968095e+02, 1.02712137e+02, 1.20333637e+02,
-6.87671031e+01, -7.74823574e-02, 3.96491065e-01, -3.55297051e+00,
-1.02911436e+00, -2.03259965e-01, -4.99736329e-01, -5.06556070e+02,
1.82845136e+00, -5.83279046e+00, -2.48325920e+02, 4.96427975e+02]),
1.0022033220456583,
{'grad': array([ 1.21769259e-02, 6.16608272e-03, 2.10140359e-03, 9.51934271e-04,
2.49551044e-04, 1.09074456e-04, 3.86159665e-16, 0.00000000e+00,
3.86159665e-16, 4.87565906e-03, 6.50257212e-05, 2.38231386e-03,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -7.71329909e-04,
-5.04626550e-06, -9.72322665e-04, 1.32452487e-03, -8.04457097e-05]),
'task': b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH',
'funcalls': 3691,
'nit': 3164,
'warnflag': 0})
[4]:
def ffnn(x, y):
W = stack(res[0])
a = np.array([x, y, 1])
for i, mat in enumerate(W):
z = mat @ a
if i != len(W)-1:
a = sigmoid(z)
a[-1] = 1
else:
a = z
return z[0]
DataFrame([[ffnn(x, y) for x in range(11)] for y in range(11)])
[4]:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.107917 | -4.509680 | -3.301543 | -2.097025 | -0.892798 | 0.314357 | 1.527558 | 2.749830 | 3.984110 | 5.233252 | 6.500034 |
| 1 | -2.189593 | -2.381000 | -0.760940 | 0.849954 | 2.454860 | 4.057948 | 5.663260 | 7.274708 | 8.896083 | 10.531060 | 12.183201 |
| 2 | -0.559515 | 1.632356 | 3.798617 | 5.944748 | 8.076089 | 10.197838 | 12.315049 | 14.056946 | 14.726858 | 16.859262 | 19.005831 |
| 3 | 0.784550 | 3.696027 | 6.568141 | 9.407148 | 12.219228 | 15.010458 | 17.786798 | 20.554069 | 23.317937 | 26.083909 | 28.857316 |
| 4 | 1.583891 | 5.397368 | 9.156940 | 12.868835 | 16.539374 | 20.174916 | 23.781814 | 27.366370 | 30.934799 | 34.493196 | 38.047508 |
| 5 | 1.598094 | 6.489061 | 11.315741 | 16.082764 | 20.795152 | 25.458238 | 30.077590 | 34.658932 | 39.208086 | 43.730900 | 48.233199 |
| 6 | 0.746070 | 6.836950 | 12.865339 | 18.832020 | 24.738539 | 30.587120 | 36.380575 | 42.122208 | 47.815733 | 53.465177 | 59.074806 |
| 7 | -0.689791 | 6.609636 | 13.869138 | 21.083529 | 28.248617 | 35.361164 | 42.418840 | 49.420150 | 56.364371 | 63.251462 | 70.081983 |
| 8 | -1.882261 | 6.469999 | 14.828811 | 23.182501 | 31.520190 | 39.831864 | 48.108417 | 56.341679 | 64.524422 | 72.650344 | 80.714044 |
| 9 | -1.462079 | 7.619747 | 16.769931 | 25.972478 | 35.211501 | 44.471372 | 53.736878 | 62.993354 | 72.226801 | 81.423989 | 90.572537 |
| 10 | 2.169688 | 11.553753 | 21.063661 | 30.683236 | 40.395559 | 50.183140 | 60.028081 | 69.912259 | 79.817512 | 89.725809 | 99.619435 |