Intuition for back propagation - numpy

Below is a forward pass and partly implemented backward pass of back propagation of a neural network :
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
X_train = np.asarray([[1,1], [0,0]]).T
Y_train = np.asarray([[1], [0]]).T
hidden_size = 2
output_size = 1
learning_rate = 0.1
forward propagation
w1 = np.random.randn(hidden_size, 2) * 0.1
b1 = np.zeros((hidden_size, 1))
w2 = np.random.randn(output_size, hidden_size) * 0.1
b2 = np.zeros((output_size, 1))
Z1 = np.dot(w1, X_train) + b1
A1 = sigmoid(Z1)
Z2 = np.dot(w2, A1) + b2
A2 = sigmoid(Z2)
derivativeA2 = A2 * (1 - A2)
derivativeA1 = A1 * (1 - A1)
first steps of back propagation
error = (A2 - Y_train)
dA2 = error / derivativeA2
dZ2 = np.multiply(dA2, derivativeA2)
What is the intuition behind :
error = (A2 - Y_train)
dA2 = error / derivativeA2
dZ2 = np.multiply(dA2, derivativeA2)
I understand error is the difference between the current prediction A2 and actual values Y_train.
But why divide this error by the derivative of A2 and then multiply the result of error / derivativeA2 by derivativeA2 ? What is intuition behind this ?

These expressions are indeed confusing:
derivativeA2 = A2 * (1 - A2)
error = (A2 - Y_train)
dA2 = error / derivativeA2
... because error doesn't have a meaning on its own. At this point, the goal is compute the derivative of the cross-entropy loss, which has this formula:
dA2 = (A2 - Y_train) / (A2 * (1 - A2))
See these lecture notes (formula 6) for the derivation. It just happens that the previous operation is sigmoid and its derivative is A2 * (1 - A2). That's why this expression is used again to compute dZ2 (formula 7).
But if you had a different loss function (say, L2) or a different squeeze layer, then A2 * (1 - A2) wouldn't be reused. These are different nodes in the computational graph.

Related

can't convert expression to float error in integration

I have a third-order ODE with time-varying coefficients of the form
y''' + A1(t)y'' + A2(t)y' + A3(t)y = 0
Where A1, A2, and A3 contain differentiation of a different function
R(t) = a - b* atan (c*t-d)
I have written the code below but I keep getting "can't convert expression to float error" error.
My python skill is not strong.
I would appreciate if I could get an advice on how to approach this.
Thanks
#import math
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from sympy import *
#from math import *
def du_dt(y,x):
return [ y[1], y[2], (- A1*y[2] - A2*y[1] + A3*y[0]) ]
V = 2.5e3
Ra = 7.5e3
C1 = 3.3e-6
RL = 1.10
L = 3.32e-6
C2 = 3.16e-12
Rd = 0.5
w1 = 1/(C1*L)
T1 = Ra*C1
C23 = C1*C2/(C1+C2)
w2 = 1/(L*C23)
R5 = 0.015 - 0.006*atan(x-5)
Tm = L/(RL+R5)
A1 = 1/Tm + 1/T1
A2 = (1/Tm).diff(x)+w2+(1/L)*R5.diff(x)+1/(T1*Tm)
A3 = (1/L)*R5.diff(x,2)+(1/(L*T1))*R5.diff(x)+(w2-w1)/T1
t = np.linspace(0,0.2,1000) # time range
y0 = [3.0425e10,0,0] # initial values
y = odeint(du_dt, y0, t)
# TypeError: can't convert expression to float

Sequential sampling from conditional multivariate normal

I'm trying to sequentially sample from a Gaussian Process prior.
The problem is that the samples eventually converge to zero or diverge to infinity.
I'm using the basic conditionals described e.g. here
Note: the kernel(X,X) function returns the squared exponential kernel with isometric noise.
Here is my code:
n = 32
x_grid = np.linspace(-5,5,n)
x_all = []
y_all = []
for x in x_grid:
x_all = [x] + x_all
X = np.array(x_all).reshape(-1, 1)
# Mean and covariance of the prior
mu = np.zeros((X.shape), np.float)
cov = kernel(X, X)
if len(mu)==1: # first sample is not conditional
y = np.random.randn()*cov + mu
else:
# condition on all previous samples
u1 = mu[0]
u2 = mu[1:]
y2 = np.atleast_2d(np.array(y_all)).T
C11 = cov[:1,:1] # dependent sample
C12 = np.atleast_2d(cov[0,1:])
C21 = np.atleast_2d(cov[1:,0]).T
C22 = np.atleast_2d(cov[1:, 1:])
C22_ = la.inv(C22)
u = u1 + np.dot(C12, np.dot(C22_, (y2 - u2)))
C22_xC21 = np.dot(C22_, C21)
C_minus = np.dot(C12, C22_xC21) # this weirdly becomes larger than C!
C = C11 - C_minus
y = u + np.random.randn()*C
y_all = [y.flatten()[0]] + y_all
Here's an example with 32 samples, where it collapses:
enter image description here
Here's an example with 34 samples, where it explodes:
enter image description here
(for this particular kernel, 34 is the number of samples at which (or more) the samples start to diverge.

How to implement a weighted product of likelihoods for multiple random variables in pyMC3?

I need to build a super-likelihood function from several random variables. The distribution of each variable is standard.
Using two random variables as an example, the target super-likelihood is like this:
S = F1^w1 * F2^w2 (s.t. w1 + w2 = 1)
or equivalently,
logS = w1 logF1 + w2 log F1 (s.t. w1 + w2 = 1).
Where F1 ~ Normal distribution and F2 ~ Bernoulli distribution
I use the following codes
data = <load my data>
[w1,w2] = [0.5,0.5]
with Model() as model:
mu = pm.Uniform('mu',lower=0,upper=1)
sd = pm.Uniform('sd',lower=0,upper=1)
p = pm.Uniform('p',lower=0,upper=1)
F1 = pm.Normal("F1", mu = mu, sigma = sd)
F1 = pm.Bernoulli("F2",p)
S = pm.Deterministic('S',F1**w1*F2**w2, observed=data)
step = Metropolis()
trace = pm.sample(2000, step=step)
But it does not work.
Please help to implement such a weighted likelihood model in pyMC3.
Seems like you are looking for the Mixture function: https://docs.pymc.io/api/distributions/mixture.html#pymc3.distributions.mixture.Mixture

6 order polynomial regression no result

I just learnt about tensorflow. To be more familiar with the syntax, I build a toy model to perform polynomial regression.
The toy dataset that I created is
x_data = np.linspace(-1, 1, 300) + np.random.uniform(-0.05, 0.05, 300)
y_data = np.linspace(-1, 1, 300) ** 2 + np.random.uniform(-0.05, 0.05, 300)
The model that I built is
batch_size = 20
x = tf.placeholder(tf.float64, [1, batch_size])
y = tf.placeholder(tf.float64, [1, batch_size])
a0 = tf.Variable(np.random.rand(1))
a1 = tf.Variable(np.random.rand(1))
a2 = tf.Variable(np.random.rand(1))
a3 = tf.Variable(np.random.rand(1))
a4 = tf.Variable(np.random.rand(1))
a5 = tf.Variable(np.random.rand(1))
a6 = tf.Variable(np.random.rand(1))
op = a6 * x ** 6 + a5 * x ** 5 + a4 * x ** 4 + a3 * x ** 3 + a2 * x ** 2 + a1 * x ** 1 + a0
error = tf.reduce_sum(tf.square(op - y))
init = tf.global_variables_initializer()
optimizer = tf.train.GradientDescentOptimizer(0.0001)
train = optimizer.minimize(error)
sess = tf.Session()
steps = 100000
sess.run(init)
for i in range(steps):
rand_int = np.random.randint(0, 300, batch_size)
x_temp = x_data[rand_int].reshape(1, batch_size)
y_temp = y_data[rand_int].reshape(1, batch_size)
feed = {x: x_temp, y: y_temp}
sess.run(train, feed)
a0, a1, a2, a3, a4, a5, a6= sess.run([a0, a1, a2, a3, a4, a5, a6])
However, after I run the model, the result that I got is:
[a0, a1, a2, a3, a4, a5, a6] = [array([ nan]), array([ nan]), array([ nan]), array([ nan]), array([ nan]), array([ nan]), array([ nan])]
Why did the model learn nothing? I've changed the learning rate to a magnitude smaller, yet the outcome is still the same.

Tensorflow - tf.matmul of conv features and a vector as a batch matmul

I tried the following code
batch_size= 128
c1 = tf.zeros([128,32,32,16])
c2 = tf.zeros([128,32,32,16])
c3 = tf.zeros([128,32,32,16])
c = tf.stack([c1, c2, c3], 4) (size: [128, 32, 32, 16, 3])
alpha = tf.zeros([128,3,1])
M = tf.matmul(c,alpha)
And it makes error at tf.matmul.
What I want is just a linear combination alpha[0]*c1 + alpha[1]*c2 + alpha[2]*c3 at each sample. When batch size is 1, this code will be fine, but when it is not how can I do it?
Should I reshape c1,c2,c3?
I think this code works; verified it.
import tensorflow as tf
import numpy as np
batch_size= 128
c1 = tf.ones([128,32,32,16])
c2 = tf.ones([128,32,32,16])
c3 = tf.ones([128,32,32,16])
c = tf.stack([c1, c2, c3], 4)
alpha = tf.zeros([1,3])
for j in range(127):
z = alpha[j] + 1
z = tf.expand_dims(z,0)
alpha = tf.concat([alpha,z],0)
M = tf.einsum('aijkl,al->aijk',c,alpha)
print('')
with tf.Session() as sess:
_alpha = sess.run(alpha)
_M = sess.run(M)
print('')