Calculate prediction derivation in own loss function - tensorflow

in addition to the MSE of y_true and y_predict i would like to use the second derivative of y_true in the cost function, because my model is currently very dynamic. Suppose I have y_predicted (256, 100, 1). The first dimension corresponds to the samples (delta_t between each sample is 0.1s). Now I would like to differentiate via the first dimension, i.e.
diff(diff(y_predicted[1, :, 1]))/delta_t**2
for each row (0-dim) in y_predictied.
Note, I only want to use y_predicted and delta_t to differentiate
Thank you very much,
Max

To calculate the second order derivative you could use tf.hessians as follow:
x = tf.Variable([7])
x2 = x * x
d2x2 = tf.hessians(x2, x)
Evaluating d2x2 yields:
[array([[2]], dtype=int32)]
In your case, you could do
loss += lam_l1 * tf.hessians(y_pred, xs)
where xs are the tensors with respect to which you would like to differentiate.
If you wish to use Keras directly, you can chain twice keras.backend.gradients(loss, variables), there is no Keras equivalent of tf.hessians.

Related

How to: TensorFlow-Probability custom loss that ignores NA values (or otherwise masks loss)

I seek to implement in TensorFlow-Probability a masked loss function, that can ignore NAs in the labels.
This is a well worn task for regular tensors. I cannot find an example for distributions.
My distributions are sized (batch, time-steps, outputs) (512, 251 days, 1 to 8 time series)
The traditional loss function given in examples is this using the distribution's log probability.
neg_log_likelihood <- function (x, rv_x) {
-1*(rv_x %>% tfd_log_prob(x))
}
When I replace NAs with zeros, the model trains fine and converges. When I leave in NAs it produces NaN losses as expected.
I've experimented with many different permutations of tf$where to replace loss with 0, the label with 0, etc. In each of those cases the model stops training and loss stays near some constant. That's the case even when there's just a single NA in the labels.
neg_log_likelihood_missing <- function (x, rv_x) {
loss = -1*( rv_x %>% tfd_log_prob(x) )
loss_nonan = tf$where( tf$math$is_finite(x) , loss, 0 )
return(
loss_nonan
)
}
My use of R here is incidental, and any examples in python or otherwise I can translate. If there's a correct way to this so that losses correctly back-propagate, I would greatly appreciate it.
If you are using gradient based inference, you may need the "double where" trick.
While this gets you a correct value of y:
y = computation(x)
tf.where(is_nan(y), 0, y)
...the derivative of the tf.where can still have a nan.
Instead write:
safe_x = tf.where(is_unsafe(x), some_safe_x, x)
y = computation(safe_x)
tf.where(is_unsafe(x), 0, y)
...to get both a safe y out and a safe dy/dx.
For the case you're considering, perhaps write:
class MyMaskedDist(tfd.Distribution):
...
def _log_prob(self, x):
safe_x = tf.where(tf.is_nan(x), self.mode(), x)
lp = compute_log_prob(safe_x)
lp = tf.where(tf.is_nan(x), tf.zeros([], lp.dtype), lp)
return lp

Is this Neural Net example I'm looking at a mistake or am I not understanding backprop?

Is this model using one relu in two places, or are gradients computed by doing a matrix multiplication of layers on both sides of one layer?
In the last layer of this simple neural net (below) during back prop it calculates the gradient for the last layer w2 by doing a matrix multiplication of y prediction - y and h_relu, which I thought was only between layers w1 and w2 not between w2 and y_pred
The line in question is near the bottom. It is grad_w2 = h_relu.t().mm(grad_y_pred).
I am confused because I thought everything was supposed to go in order forward and go in order backwards. Is this relu being used in two places?
Here is an attempt at a visual illustration of the model.
This example is from the Pytorch website. It is the second block of code on the page.
grad_w2 = h_relu.t().mm(grad_y_pred)
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
I appreciate your patience looking at this and trying to clear this up for me.
If you can try adding another layer of whieghts in the middle with another relu that might help me understand. This is what I was trying to do.
Consider the following diagram which represents the network in question. The concept of back-propagation is simply a way to quickly and intuitively apply the chain rule on a complex sequence of operations to compute the gradient of an output w.r.t. a tensor. Usually we are interested in computing the gradients of leaf tensors (tensors which are not derived from other tensors) with respect to a loss or objective. All the leaf tensors are represented as circles in the following diagram and the loss is represented by the rectangle with the L label.
Using the backward diagram we can follow the path from L to w1 and w2 in order to determine which partial derivatives we need in order to compute the gradient of L w.r.t. w1 and w2. For simplicity we will assume that all the leaf tensors are scalars so as to avoid getting into the complexities of multiplying vectors and matrices.
Using this approach the gradients of L w.r.t. w1 and w2 are
and
Something to notice is that since w2 is a leaf tensor, we only use dy/dw2 (aka grad_w2) during computation of dL/dw2 since it isn't part of the path from L to w1.

Accessing elements of a placeholder in tensorflow [duplicate]

This question already has answers here:
Weighted cost function in tensorflow
(2 answers)
Closed 4 years ago.
I have a neural network with MSE loss function being implemented something like this:
# input x_ph is of size Nx1 and output should also be of size Nx1
def train_neural_network_batch(x_ph, predict=False):
prediction = neural_network_model(x_ph)
# MSE loss function
cost = tf.reduce_mean(tf.square(prediction - y_ph))
optimizer = tf.train.AdamOptimizer(learn_rate).minimize(cost)
# mini-batch optimization here
I'm fairly new to neural networks and Python, but I understand that each iteration, a sample of training points will be fed into the neural network and the loss function evaluated at the points in this sample. However, I would like to be able to modify the loss function so that it weights certain data more heavily. Some pseudocode of what I mean
# manually compute the MSE of the data without the first sampled element
cost = 0.0
for ii in range(1,len(y_ph)):
cost += tf.square(prediction[ii] - y_ph[ii])
cost = cost/(len(y_ph)-1.0)
# weight the first sampled data point more heavily according to some parameter W
cost += W*(prediction[0] - y_ph[0])
I might have more points I wish to weight differently as well, but for now, I'm just wondering how I can implement something like this in tensorflow. I know len(y_ph) is invalid as y_ph is just a placeholder, and I can't just do something like y_ph[i] or prediction[i].
You can do this in multiple ways:
1) If some of your data instances weighting are simply 2 times or 3 times more than normal instance, you may just copy those instance multiple times in your data set. Thus they would occupy more weight in loss, hence satisfy your intention. This is the simplest way.
2) If your weighting is more complex, say a float weighting. You can define a placeholder for weighting, multiply it to loss, and use feed_dict to feed the weighting in session together with x batch and y batch. Just make sure instance_weight is the same size with batch_size
E.g.
import tensorflow as tf
import numpy as np
with tf.variable_scope("test", reuse=tf.AUTO_REUSE):
x = tf.placeholder(tf.float32, [None,1])
y = tf.placeholder(tf.float32, [None,1])
instance_weight = tf.placeholder(tf.float32, [None,1])
w1 = tf.get_variable("w1", shape=[1, 1])
prediction = tf.matmul(x, w1)
cost = tf.square(prediction - y)
loss = tf.reduce_mean(instance_weight * cost)
opt = tf.train.AdamOptimizer(0.5).minimize(loss)
with tf.Session() as sess:
x1 = [[1.],[2.],[3.]]
y1 = [[2.],[4.],[3.]]
instance_weight1 = [[10.0], [10.0], [0.1]]
sess.run(tf.global_variables_initializer())
print (x1)
print (y1)
print (instance_weight1)
for i in range(1000):
_, loss1, prediction1 = sess.run([opt, loss, prediction], feed_dict={instance_weight : instance_weight1, x : x1, y : y1 })
if (i % 100) == 0:
print(loss1)
print(prediction1)
NOTE instance_weight1, you may change instance_weight1 to see the difference (here batch_size is set to 3)
Where x1,y1 and x2,y2 follow the rule y=2*x
Whereas x3,y3 follow the rule y=x
But with different weight as [10,10,0.1], the prediction1 coverage to y1,y2 rule and almost ignored y3, the output are as:
[[1.9823183]
[3.9646366]
[5.9469547]]
PS: in tensorflow graph, it's highly recommended not to use for loops, but use matrix operator instead to parallel the calculation.

Tensorflow: Loss function which takes one-hot as argument

My LSTM RNN has to predict a single letter(Y), given preceding words before(X).
For example, if "Oh, say! can you see by the dawn's early ligh" is given as X, then Y would be "t"(part of National Anthem). Each Alpabets are one-hot coded. So, g in one-hot coded is for example, [0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0].
dataX:[batch_size,20,num_of_classes], dataY:[batch_size,1,num_of_classes]
In this case, what loss function would be best for prediction?
Both X and Y are one-hot encoded, X are many and Y is one.
I rarely find loss functions which takes one-hot as parameter(such as, parameter for logits or target).
What you are looking for is the cross entropy between
Y_ (ground truth) and Y (probabilities)
You could use a basic hand coded cross entropy like
y = tf.nn.softmax( logit_layer )
loss = -tf.reduce_mean(tf.reduce_mean( y_ * tf.log(y) ))
Or you could use the built in TensorFlow function
loss = tf.nn.softmax_cross_entropy_with_logits( labels=y_, logits=logit_layer)
Your Y output would be something like [0.01,0.02,0.01,.98,0.02,...] and your logit_layer is just the raw output before applying softmax.

Efficent way in TensorFlow to subtract mean and divide by standard deivation for each row

I have a tensor of shape [x, y] and I want to subtract the mean and divide by the standard deviation row-wise (i.e. I want to do it for each row). What is the most efficient way to do this in TensorFlow?
Of course I can loop through rows as follows:
new_tensor = [i - tf.reduce_mean(i) for i in old_tensor]
...to subtract the mean and then do something similar to find the standard deviation and divide by it, but is this the best way to do it in TensorFlow?
The TensorFlow tf.sub() and tf.div() operators support broadcasting, so you don't need to iterate through every row. Let's consider the mean, and leave standard deviation as an exercise:
old_tensor = ... # shape = (x, y)
mean = tf.reduce_mean(old_tensor, 1, keep_dims=True) # shape = (x, 1)
stdev = ... # shape = (x,)
stdev = tf.expand_dims(stdev, 1) # shape = (x, 1)
new_tensor = old_tensor - mean # shape = (x, y)
new_tensor = old_tensor / stdev # shape = (x, y)
The subtraction and division operators implicitly broadcast a tensor of shape (x, 1) along the column dimension to match the shape of the other argument, (x, y). For more details about how broadcasting works, see the NumPy documentation on the topic (TensorFlow implements NumPy broadcasting semantics).
calculate moments along axis 1 (y in your case) and keep dimensions, i.e. shape of mean and var is (len(x), 1)
subtract mean and divide by standard deviation (i.e. square root of variance)
mean, var = tf.nn.moments(old_tensor, [1], keep_dims=True)
new_tensor = tf.div(tf.subtract(old_tensor, mean), tf.sqrt(var))