Custom Keras loss to minimize count of elements above a given threshold - tensorflow

I am trying to create a custom loss function for a regression problem that would minimize the number of elements that falls above a certain threshold. my code for this is:
import tensorflow as tf
epsilon = 0.000001
def custom_loss(actual, predicted): # loss
actual = actual * 12
predicted = predicted * 12
# outputs a value between 1 and 20
vector = tf.sqrt(2 * (tf.square(predicted - actual + epsilon)) / (predicted + actual + epsilon))
# Count number of elements above threshold value of 5
fail_count = tf.cast(tf.size(vector[vector>5]), tf.float32)
return fail_count
I however, run into the following error:
ValueError: No gradients provided for any variable: ...
How do I solve this problem?

I don't think you can use this loss function, because the loss does not vary smoothly as the model parameters vary - it will jump from one value to another different value as parameters pass a theshold point. So tensorflow can't calculate gradients, and so can't train the model.
It's the same reason that 'number of images incorrectly classified' isn't used as a loss function, and categorical cross-entropy, which does vary smoothly as parameters change, is used instead.
You may need to find a smoothly varying function that approximates what you want.
[Added after your response below...]
This might do it. It becomes closer to your function as temperature is reduced. But it may not have good training dynamics, and there could be better solutions out there. One approach might be to start training with relatively large temperature, and reduce it as training progresses.
temperature = 1.0
fail_count=tf.reduce_sum(tf.math.sigmoid((vector-5.)/temperature))

Related

GPflow multi-class: How to squeeze many gp.predict_f_samples to obtain their probabilities?

I classify MNIST digits and I want to sample the probabilities (not the latent function) for each class on multiple many times. However, gp.predict_y gives the probabilities just for one case.
Thus I take f_samples = gp.predict_f_samples which returns numerous examples from the underlying latent function.
Now, how to 'squeeze' the f_samples through the robust_max likelihood?
Code for my gp:
kernel = gpflow.kernels.Matern52(input_dim=128, ARD=ARD, active_dims=np.arange(128))\
+ gpflow.kernels.White(input_dim=128, active_dims=np.arange(128))
# Robustmax Multiclass Likelihood
invlink = gpflow.likelihoods.RobustMax(10) # Robustmax inverse link function
likelihood = gpflow.likelihoods.MultiClass(10, invlink=invlink) # Multiclass likelihood
Z = x_train[::5].copy() # inducing inputs
gp = gpflow.models.SVGP(x_train, y_train, num_latent=10,
kern=kernel, Z=Z, likelihood=likelihood,
whiten=True, q_diag=True)
GPflow version: 1.5.1
Once you've sampled you're no longer working with a probability distribution - you have actual values for each of your 10 latent functions. To convert a sample to probabilities over the classes you can just apply the RobustMax function (probability 1-epsilon for the largest latent function, epsilon/9 for all the others) to the 10 values you get. E.g.
eps = 0.001
f_samples = gp.predict_f_samples(x_test, num_samples)
largests = np.argmax(f_samples , axis = 2)
prob_samples = (np.eye(10)[largests]*(1-eps-eps/9)+eps/9)
Note that the probabilities you get will all be 0.999 on one class and 0.0001 on all the others - that's what RobustMax is. If you're intending to average over your samples, you probably just want to call gp.predict_y(), which actually integrates the RobustMax over the probability distribution and can give you some smoother class probabilities if the latent means are close.

Higher loss penalty for true non-zero predictions

I am building a deep regression network (CNN) to predict a (1000,1) target vector from images (7,11). The target usually consists of about 90 % zeros and only 10 % non-zero values. The distribution of (non-) zero values in the targets vary from sample to sample (i.e. there is no global class imbalance).
Using mean sqaured error loss, this led to the network predicting only zeros, which I don't find surprising.
My best guess is to write a custom loss function that penalizes errors regarding non-zero values more than the prediction of zero-values.
I have tried this loss function with the intend to implement what I have guessed could work above. It is a mean squared error loss in which the predictions of non-zero targets are penalized less (w=0.1).
def my_loss(y_true, y_pred):
# weights true zero predictions less than true nonzero predictions
w = 0.1
y_pred_of_nonzeros = tf.where(tf.equal(y_true, 0), y_pred-y_pred, y_pred)
return K.mean(K.square(y_true-y_pred_of_nonzeros)) + K.mean(K.square(y_true-y_pred))*w
The network is able to learn without getting stuck with only-zero predictions. However, this solution seems quite unclean. Is there a better way to deal with this type of problem? Any advice on improving the custom loss function?
Any suggestions are welcome, thank you in advance!
Best,
Lukas
Not sure there is anything better than a custom loss just like you did, but there is a cleaner way:
def weightedLoss(w):
def loss(true, pred):
error = K.square(true - pred)
error = K.switch(K.equal(true, 0), w * error , error)
return error
return loss
You may also return K.mean(error), but without mean you can still profit from other Keras options like adding sample weights and other things.
Select the weight when compiling:
model.compile(loss = weightedLoss(0.1), ...)
If you have the entire data in an array, you can do:
w = K.mean(y_train)
w = w / (1 - w) #this line compesates the lack of the 90% weights for class 1
Another solution that can avoid using a custom loss, but requires changes in the data and the model is:
Transform your y into a 2-class problem for each output. Shape = (batch, originalClasses, 2).
For the zero values, make the first of the two classes = 1
For the one values, make the second of the two classes = 1
newY = np.stack([1-oldY, oldY], axis=-1)
Adjust the model to output this new shape.
...
model.add(Dense(2*classes))
model.add(Reshape((classes,2)))
model.add(Activation('softmax'))
Make sure you are using a softmax and a categorical_crossentropy as loss.
Then use the argument class_weight={0: w, 1: 1} in fit.

How to make a selective back-propagation in a mini-batch in Tensorflow?

Recently, I'm working on a project "predicting future trajectories of objects from their past trajectories by using LSTMs in Tensorflow."
(Here, a trajectory means a sequence of 2D positions.)
Input to the LSTM is, of course, 'past trajectories' and output is 'future trajectories'.
The size of mini-batch is fixed when training. However, the number of past trajectories in a mini-batch can be different. For example, let the mini-batch size be 10. If I have only 4 past trajectories for the current training iteration, 6 out of 10 in the mini-batch is padded with zero value.
When calculating the loss for the back-propagation, I let the loss from the 6 be zero so that the only 4 contribute to the back-propagation.
The problem that I concern is..it seems that Tensorflow still calculates gradients for the 6 even if their loss is zero. As a result, the training speed becomes slower as I increase the mini-batch size even if I used the same training data.
I also used tf.where function when calculating the loss. However, the training time does not decrease.
How can I reduce the training time?
Here I attached my pseudo code for training.
# For each frame in a sequence
for f in range(pred_length):
# For each element in a batch
for b in range(batch_size):
with tf.variable_scope("rnnlm") as scope:
if (f > 0 or b > 0):
scope.reuse_variables()
# for each pedestrian in an element
for p in range(MNP):
# ground-truth position
cur_gt_pose = ...
# loss mask
loss_mask_ped = ... # '1' or '0'
# go through RNN decoder
output_states_dec_list[b][p], zero_states_dec_list[b][p] = cell_dec(cur_embed_frm_dec,
zero_states_dec_list[b][p])
# fully connected layer for output
cur_pred_pose_dec = tf.nn.xw_plus_b(output_states_dec_list[b][p], output_wd, output_bd)
# go through embedding function for the next input
prev_embed_frms_dec_list[b][p] = tf.reshape(tf.nn.relu(tf.nn.xw_plus_b(cur_pred_pose_dec, embedding_wd, embedding_bd)), shape=(1, rnn_size))
# calculate MSE loss
mse_loss = tf.reduce_sum(tf.pow(tf.subtract(cur_pred_pose_dec, cur_gt_pose_dec), 2.0))
# only valid ped's traj contributes to the loss
self.loss += tf.multiply(mse_loss, loss_mask_ped)
I think you're looking for the function tf.stop_gradient. Using this, you could do something like tf.where(loss_mask, tensor, tf.stop_gradient(tensor)) to achieve the desired result, assuming that the dimensions are correct.
However, it looks like this is probably not your issue. It seems as though for each item in your dataset, you are defining new graph nodes. This is not how TensorFlow is supposed to function, you should only have one graph, built beforehand that performs some fixed function, regardless of the batch size. You should definitely not be defining new nodes for every element in the batch, since that cannot efficiently take advantage of parallelism.

How to implement customized loss function and perform gradient back-propagation in tensorflow

I need to implement YOLOv2 based on tensorflow framework.
Firstly, in my network design, there are five anchors for each cell and one class (face), thus finally the network outputs 4D tensor that has n * c * h * w shape. Here n represents the batch size, c = 5 * (location coordinates + objectiveness score + classification probability) = 5 * (4 + 1 + 1) = 30, and h/w represent height and width of feature map respectively.
Secondly, YOLOv2 adopts multi-task loss function:
So I defined the following function to calculate the total loss:
def yolov2_loss_function(pred, ground_truth, global_step)
This function accept three parameters: pred respresents the network output tensor which is already described above, ground_truth represents the corresponding GT, and global_step represents the number of iterations. This function returns a scalar value which is used to denote the totol loss.
Finally I use the following code to perform SGD train:
......
total_loss = yolov2_loss_function(pred, gt, global_step)
optimizer = tf.train.MomentumOptimizer(learning_rate=lr, momentum=momentum).minimize(total_loss, global_step=global_step)
......
I am not sure if the above process is correct. Especially the total_loss variable is just a scalar, how does the tensorflow framework know the residual/gradient of each element in the output tensor and further perform backward propagation? I know mechanism of automatic differential,but the premise of the automatic differential is each output element should have residual.
Although in the function yolov2_loss_function I firstly calculate each element's residual, then output their total loss. However, how does the tensorflow framework know the residual of each output element?
Thank you very much.
I think you mixed it up.
For any function differentiatable function $f\colon \mathbb{R}^n -> \mathbb{R}, x\mapsto f(x)$ the partial derivatives $\frac{\partial}{\partial x_i} f(x)$ exists.
Hence, even from a scalar-valued function the gradient can be written as vector, if the function maps a vector to a scalar.
For "f(a,b)=a * a * b" the derivative wrt. a is "2 * a * b" and wrt. b it is "a * a". No issue here.

Tensorflow num_classes parameter of nce_loss()

My understanding of noise contrastive estimation is that we sample some vectors from our word embeddings (the negative sample), and then calculate the log-likelihood of each. Then we want to maximize the difference between the probability of the target word and the log-likelihood of each of the negative sample words (So if I'm correct about this, we want to optimize the loss function so that it gets as close to 1 as possible).
My question is this:
What is the purpose of the num_classes parameters to the nce_loss function? My best guess is that the number of classes is passed in so that Tensorflow knows the size of the distribution from which the negative samples our drawn, but this might not make sense, since we could just infer the size of the distribution from the variable itself. Otherwise, I can't think of a reason for why we would need to know the total possible number of classes, especially if the language model is only outputting k + 1 predictions (negative sample size + 1 for the target word).
Your guess is correct. The num_classes argument is used to sample negative labels from the log-uniform (Zipfian) distribution.
Here's the link to the source code:
# Sample the negative labels.
# sampled shape: [num_sampled] tensor
# true_expected_count shape = [batch_size, 1] tensor
# sampled_expected_count shape = [num_sampled] tensor
if sampled_values is None:
sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
true_classes=labels,
num_true=num_true,
num_sampled=num_sampled,
unique=True,
range_max=num_classes)
The range_max=num_classes argument basically defines the shape of this distribution and also the range of the sampled values - [0, range_max). Note that this range can't be accurately inferred from the labels, because a particular mini-batch can have only small word ids, which would skew the distribution significantly.