I am working on parameterizing b-splines knots and control points with deep learning. To do this, I have a model which outputs the knots and control points, and I want to let the model learn by implementing a custom loss function which evaluates the b-spline and then calculates the mean squared error against the actual curve. Hopefully, the model will learn to output the correct knots and control points.
However, Keras custom loss function requires all statements to be wrapped in backend functions. From scipy B-spline docs, this is how the function will look like normally.
def B(x, k, i, t):
if k == 0:
return 1.0 if t[i] <= x < t[i+1] else 0.0
if t[i+k] == t[i]:
c1 = 0.0
else:
c1 = (x - t[i])/(t[i+k] - t[i]) * B(x, k-1, i, t)
if t[i+k+1] == t[i+1]:
c2 = 0.0
else:
c2 = (t[i+k+1] - x)/(t[i+k+1] - t[i+1]) * B(x, k-1, i+1, t)
return c1 + c2
def bspline(x, t, c, k):
n = len(t) - k - 1
assert (n >= k+1) and (len(c) >= n)
return sum(c[i] * B(x, k, i, t) for i in range(n))
I know that for simple statements such as summation, I can use K.sum, and for if-else statements, I can use K.switch. However, I do not know how to implement for-loops in Keras. Is there a straight forward way to compute B-spline curves in Keras? (converting the above code into Keras backend)
Related
Extremely new to Julia, so please pardon any obvious oversights on my end
I am trying to estimate a piecewise likelihood function through optimization. I have the code functional in R, but have begun translating it to Julia in the hopes of faster estimation, for eventual bootstrapping
Here is the current block of code that I am trying (v and x are already as 1000x1 vectors elsewhere defined elsewhere):
function est(a,b)
function pwll(v,x)
if v>4
ILL=pdf(Poisson(exp(a+b*x)), v)
elseif v==4
ILL=pdf(Poisson(exp(a+b*x)), 4)+pdf(Poisson(exp(a+b*x)),3)+pdf(Poisson(exp(a+b*x)),2)
else v==0
ILL=pdf(Poisson(exp(a+b*x)), 1)+pdf(Poisson(exp(a+b*x)), 0)
end
return(ILL)
end
ILL=pwll.(v, x)
function fixILL(x)
if x==0
x=0.00000000000000001
else
x=x
end
end
ILL=fixILL.(ILL)
LILL=log10.(ILL)
LL=-1*LILL
return(sum(LL))
end
using Optim
params0=[1,1]
optimize(est, params0)
And the error message(s) I am getting are:
ERROR: InexactError: Int64(NaN)
Stacktrace:
[1] Int64(x::Float64)
# Base ./float.jl:788
[2] x_of_nans(x::Vector{Int64}, Tf::Type{Int64}) (repeats 2 times)
# NLSolversBase ~/.julia/packages/NLSolversBase/kavn7/src/NLSolversBase.jl:60
[3] NonDifferentiable(f::Function, x::Vector{Int64}, F::Int64; inplace::Bool)
# NLSolversBase ~/.julia/packages/NLSolversBase/kavn7/src/objective_types/nondifferentiable.jl:11
[4] NonDifferentiable(f::Function, x::Vector{Int64}, F::Int64)
# NLSolversBase ~/.julia/packages/NLSolversBase/kavn7/src/objective_types/nondifferentiable.jl:10
[5] promote_objtype(method::NelderMead{Optim.AffineSimplexer, Optim.AdaptiveParameters}, x::Vector{Int64}, autodiff::Symbol, inplace::Bool, args::Function)
# Optim ~/.julia/packages/Optim/tP8PJ/src/multivariate/optimize/interface.jl:63
[6] optimize(f::Function, initial_x::Vector{Int64}; inplace::Bool, autodiff::Symbol, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# Optim ~/.julia/packages/Optim/tP8PJ/src/multivariate/optimize/interface.jl:86
[7] optimize(f::Function, initial_x::Vector{Int64})
# Optim ~/.julia/packages/Optim/tP8PJ/src/multivariate/optimize/interface.jl:83
[8] top-level scope
# ~/Documents/Projects/ki_new/peicewise_ll.jl:120
I understand that it seems the error is coming from the function to be optimized being non-differentiable. A fairly direct translation works well in R, using the built in optim() function.
Can anyone provide any insight?
I have tried the above code displayed above, with multiple variations. The function to be optimized is functional, I am struggling with the optimization (the issues of which may stem from function being inefficiently written)
Here's an adapted version of your code which produces a solution:
using Distributions, Optim
function pwll(v, x, a, b)
d = Poisson(exp(a+b*x))
if v > 4
return pdf(d, v)
elseif v == 4
return pdf(d, 4) + pdf(d, 3) + pdf(d, 2)
else
return pdf(d, 1) + pdf(d, 0)
end
end
fixILL(x) = iszero(x) ? 1e-17 : x
est(a, b, v, x) = sum(-1 .* log10.(fixILL.(pwll.(v, x, a, b))))
v = 4; x = 0.5 # Defining these here as they are not given in your post
obj(input; v = v, x = x) = est(input[1], input[2], v, x)
optimize(obj, [1.0, 1.0])
I have no idea whether this is correct of course, check this against some sort of known result if you can.
Based on the example as quoted in tensorflow's website here: https://www.tensorflow.org/api_docs/python/tf/custom_gradient
#tf.custom_gradient
def op_with_fused_backprop(x):
y, x_grad = fused_op(x)
def first_order_gradient(dy):
#tf.custom_gradient
def first_order_custom(unused_x):
def second_order_and_transpose(ddy):
return second_order_for_x(...), gradient_wrt_dy(...)
return x_grad, second_order_and_transpose
return dy * first_order_custom(x)
return y, first_order_gradient
There is a lack of details on why second_order_and_transpose(ddy) returns two objects. Based on the documentation of tf.custom_gradient, the grad_fn (i.e. second_order_and_transpose()) should return a list of Tensors which are the derivatives of dy w.r.t. unused_x. It is also not even clear why did they name it unused_x. Anyone has any idea on this example or in general create custom gradients for higher order derivatives?
There is a lack of details on why second_order_and_transpose(ddy) returns two objects.
Based on what I played with some examples, I believe you are correct. The official doc is somehow ambiguous (or incorrect). The second_order_and_transpose(ddy) should only return the one object, which is the calculated second-order gradient.
It is also not even clear why did they name it unused_x.
That is the tricky part. The unused_x explains why they name it (because you never going to use it...). The goal here is to wrap your second-order calculation function in a function called first_order_custom. You calculate the gradient of x from fused_op, and use that as a return value, instead of unused_x.
To make this more clear, I passed an example extended from the official document to define a second-order gradient of the log1pexp:
NOTE: The second-order gradient is not numerically stable, so let's use (1 - tf.exp(x)) to represent it, just to make our life easier.
#tf.custom_gradient
def log1pexp2(x):
e = tf.exp(x)
y = tf.math.log(1 + e)
x_grad = 1 - 1 / (1 + e)
def first_order_gradient(dy):
#tf.custom_gradient
def first_order_custom(unused_x):
def second_order_gradient(ddy):
# Let's define the second-order graidne to be (1 - e)
return ddy * (1 - e)
return x_grad, second_order_gradient
return dy * first_order_custom(x)
return y, first_order_gradient
To test the script, simply run:
import tensorflow as tf
#tf.custom_gradient
def log1pexp2(x):
e = tf.exp(x)
y = tf.math.log(1 + e)
x_grad = 1 - 1 / (1 + e)
def first_order_gradient(dy):
#tf.custom_gradient
def first_order_custom(unused_x):
def second_order_gradient(ddy):
# Let's define the second-order graidne to be (1 - e)
return ddy * (1 - e)
return x_grad, second_order_gradient
return dy * first_order_custom(x)
return y, first_order_gradient
x1 = tf.constant(1.)
y1 = log1pexp2(x1)
dy1 = tf.gradients(y1, x1)
ddy1 = tf.gradients(dy1, x1)
x2 = tf.constant(100.)
y2 = log1pexp2(x2)
dy2 = tf.gradients(y2, x2)
ddy2 = tf.gradients(dy2, x2)
with tf.Session() as sess:
print('x=1, dy1:', dy1[0].eval(session=sess))
print('x=1, ddy1:', ddy1[0].eval(session=sess))
print('x=100, dy2:', dy2[0].eval(session=sess))
print('x=100, ddy2:', ddy2[0].eval(session=sess))
Result:
x=1, dy1: 0.7310586
x=1, ddy1: -1.7182817
x=100, dy2: 1.0
x=100, ddy2: -inf
I'm trying to make LSTM in tensorflow 2.1 from scratch, without using the one already supplied with keras (tf.keras.layers.LSTM), just to learn and code something. To do so, I've defined a class "Model" that when called (like with model(input)) it computes the matrix multiplications of the LSTM. I'm pasting here part of my code, the other parts are on github (link)
class Model(object):
[...]
def __call__(self, inputs):
assert inputs.shape == (vocab_size, T_steps)
outputs = []
for time_step in range(T_steps):
x = inputs[:,time_step]
x = tf.expand_dims(x,axis=1)
z = tf.concat([self.h_prev,x],axis=0)
f = tf.matmul(self.W_f, z) + self.b_f
f = tf.sigmoid(f)
i = tf.matmul(self.W_i, z) + self.b_i
i = tf.sigmoid(i)
o = tf.matmul(self.W_o, z) + self.b_o
o = tf.sigmoid(o)
C_bar = tf.matmul(self.W_C, z) + self.b_C
C_bar = tf.tanh(C_bar)
C = (f * self.C_prev) + (i * C_bar)
h = o * tf.tanh(C)
v = tf.matmul(self.W_v, h) + self.b_v
v = tf.sigmoid(v)
y = tf.math.softmax(v, axis=0)
self.h_prev = h
self.C_prev = C
outputs.append(y)
outputs = tf.squeeze(tf.stack(outputs,axis=1))
return outputs
But this neural netoworks has three problems:
1) it is way slow during training. In comparison a model that uses tf.keras.layers.LSTM() is trained more than 10 times faster. Why is this? Maybe because I didn't use a minibatch training, but a stochastic one?
2) the NN seems to not learn anything at all. After just some (very few!) training examples, the loss seems to settle down and it won't decrease anymore, but rather it oscillates around the reached value. After training, I tested the NN making it generate some text, but it just outputs non-sense gibberish. Why isn't learning anything?
3) the loss function outputs very high values. I've coded a categorical cross-entropy loss function but, with 100 characters long sequence, the value of the function is over 370 per training example. Shouldn't it be way lower than this?
I've wrote the loss function like this:
def compute_loss(predictions, desired_outputs):
l = 0
for i in range(T_steps):
l -= tf.math.log(predictions[desired_outputs[i], i])
return l
I know they're open questions, but unfortunately I can't make it works. So any answer, even a short answer that help me to make myself solve the problem, is fine :)
I am trying to train an autoencoder NN (3 layers - 2 visible, 1 hidden) using numpy and scipy for the MNIST digits images dataset. The implementation is based on the notation given here Below is my code:
def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
"""
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
cost : scalar representing the overall cost J(theta)
grad : array representing the corresponding gradient of each element of theta
"""
training_size = data.shape[1]
# unroll theta to get (W1,W2,b1,b2) #
W1 = theta[0:hidden_size*visible_size]
W1 = W1.reshape(hidden_size,visible_size)
W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
W2 = W2.reshape(visible_size,hidden_size)
b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]
#feedforward pass
a_l1 = data
z_l2 = W1.dot(a_l1) + numpy.tile(b1,(training_size,1)).T
a_l2 = sigmoid(z_l2)
z_l3 = W2.dot(a_l2) + numpy.tile(b2,(training_size,1)).T
a_l3 = sigmoid(z_l3)
#backprop
delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
delta_l2 = numpy.multiply(W2.T.dot(delta_l3),
numpy.multiply(a_l2, 1 - a_l2))
b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
b1_derivative = numpy.sum(delta_l2,axis=1)/training_size
W2_derivative = numpy.dot(delta_l3,a_l2.T)/training_size + lambda_*W2
#print(W2_derivative.shape)
W1_derivative = numpy.dot(delta_l2,a_l1.T)/training_size + lambda_*W1
W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
b1_derivative = b1_derivative.reshape(hidden_size)
b2_derivative = b2_derivative.reshape(visible_size)
grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
return cost,grad
I have also implemented a function to estimate the numerical gradient and verify the correctness of my implementation (below).
def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
"""
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
the gradient by numerical difference
:return: array of numerical gradient estimate
"""
gradient = numpy.zeros(theta.shape)
eps_vector = numpy.zeros(theta.shape)
for i in range(0,theta.size):
eps_vector[i] = epsilon
cost1,grad1 = J(theta+eps_vector)
cost2,grad2 = J(theta-eps_vector)
gradient[i] = (cost1 - cost2)/(2*epsilon)
eps_vector[i] = 0
return gradient
The norm of the difference between the numerical estimate and the one computed by the function is around 6.87165125021e-09 which seems to be acceptable. My main problem seems to be to get the gradient descent algorithm "L-BGFGS-B" working using the scipy.optimize.minimize function as below:
# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)
I get the below output from this:
scipy.optimize.minimize() details:
fun: 90.802022224079778
hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
jac: array([ -6.83667742e-06, -2.74886002e-06, -3.23531941e-06, ...,
1.22425735e-01, 1.23425062e-01, 1.28091250e-01])
message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
nfev: 21
nit: 0
status: 2
success: False
x: array([-0.06836677, -0.0274886 , -0.03235319, ..., 0. ,
0. , 0. ])
Now, this post seems to indicate that the error could mean that the gradient function implementation could be wrong? But my numerical gradient estimate seems to confirm that my implementation is correct. I have tried varying the initial weights by using a uniform distribution as specified here but the problem still persists. Is there anything wrong with my backprop implementation?
Turns out the issue was a syntax error (very silly) with this line:
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
I don't even have the lambda parameter x in the function declaration. So the theta array wasn't even being passed whenever J was being invoked.
This fixed it:
J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)
I have a loss function I would like to try and minimize:
def lossfunction(X,b,lambs):
B = b.reshape(X.shape)
penalty = np.linalg.norm(B, axis = 1)**(0.5)
return np.linalg.norm(np.dot(X,B)-X) + lambs*penalty.sum()
Gradient descent, or similar methods, might be useful. I can't calculate the gradient of this function analytically, so I am wondering how I can numerically calculate the gradient for this loss function in order to implement a descent method.
Numpy has a gradient function, but it requires me to pass a scalar field at pre determined points.
You could try scipy.optimize.minimize
For your case a sample call would be:
import scipy.optimize.minimize
scipy.optimize.minimize(lossfunction, args=(b, lambs), method='Nelder-mead')
You could estimate the derivative numerically by a central difference:
def derivative(fun, X, b, lambs, h):
return (fun(X + 0.5*h,b,lambs) - fun(X - 0.5*h,b,lambs))/h
And use it like this:
# assign values to X, b, lambs
# set the value of h
h = 0.001
print derivative(lossfunction, X, b, lambs, h)
The code above is valid for dimX = 1, some modifications are needed to account for multidimensional vector X:
def gradient(fun, X, b, lambs, h):
res = []
for i in range (0,len(X)):
t1 = list(X)
t1[i] = t1[i] + 0.5*h
t2 = list(X)
t2[i] = t2[i] - 0.5*h
res = res + [(fun(t1,b,lambs) - fun(t2,b,lambs))/h]
return res
Forgive the naivity of the code, I barely know how to write some python :-)