Higher loss penalty for true non-zero predictions - tensorflow

I am building a deep regression network (CNN) to predict a (1000,1) target vector from images (7,11). The target usually consists of about 90 % zeros and only 10 % non-zero values. The distribution of (non-) zero values in the targets vary from sample to sample (i.e. there is no global class imbalance).
Using mean sqaured error loss, this led to the network predicting only zeros, which I don't find surprising.
My best guess is to write a custom loss function that penalizes errors regarding non-zero values more than the prediction of zero-values.
I have tried this loss function with the intend to implement what I have guessed could work above. It is a mean squared error loss in which the predictions of non-zero targets are penalized less (w=0.1).
def my_loss(y_true, y_pred):
# weights true zero predictions less than true nonzero predictions
w = 0.1
y_pred_of_nonzeros = tf.where(tf.equal(y_true, 0), y_pred-y_pred, y_pred)
return K.mean(K.square(y_true-y_pred_of_nonzeros)) + K.mean(K.square(y_true-y_pred))*w
The network is able to learn without getting stuck with only-zero predictions. However, this solution seems quite unclean. Is there a better way to deal with this type of problem? Any advice on improving the custom loss function?
Any suggestions are welcome, thank you in advance!
Best,
Lukas

Not sure there is anything better than a custom loss just like you did, but there is a cleaner way:
def weightedLoss(w):
def loss(true, pred):
error = K.square(true - pred)
error = K.switch(K.equal(true, 0), w * error , error)
return error
return loss
You may also return K.mean(error), but without mean you can still profit from other Keras options like adding sample weights and other things.
Select the weight when compiling:
model.compile(loss = weightedLoss(0.1), ...)
If you have the entire data in an array, you can do:
w = K.mean(y_train)
w = w / (1 - w) #this line compesates the lack of the 90% weights for class 1
Another solution that can avoid using a custom loss, but requires changes in the data and the model is:
Transform your y into a 2-class problem for each output. Shape = (batch, originalClasses, 2).
For the zero values, make the first of the two classes = 1
For the one values, make the second of the two classes = 1
newY = np.stack([1-oldY, oldY], axis=-1)
Adjust the model to output this new shape.
...
model.add(Dense(2*classes))
model.add(Reshape((classes,2)))
model.add(Activation('softmax'))
Make sure you are using a softmax and a categorical_crossentropy as loss.
Then use the argument class_weight={0: w, 1: 1} in fit.

Related

Custom Keras loss to minimize count of elements above a given threshold

I am trying to create a custom loss function for a regression problem that would minimize the number of elements that falls above a certain threshold. my code for this is:
import tensorflow as tf
epsilon = 0.000001
def custom_loss(actual, predicted): # loss
actual = actual * 12
predicted = predicted * 12
# outputs a value between 1 and 20
vector = tf.sqrt(2 * (tf.square(predicted - actual + epsilon)) / (predicted + actual + epsilon))
# Count number of elements above threshold value of 5
fail_count = tf.cast(tf.size(vector[vector>5]), tf.float32)
return fail_count
I however, run into the following error:
ValueError: No gradients provided for any variable: ...
How do I solve this problem?
I don't think you can use this loss function, because the loss does not vary smoothly as the model parameters vary - it will jump from one value to another different value as parameters pass a theshold point. So tensorflow can't calculate gradients, and so can't train the model.
It's the same reason that 'number of images incorrectly classified' isn't used as a loss function, and categorical cross-entropy, which does vary smoothly as parameters change, is used instead.
You may need to find a smoothly varying function that approximates what you want.
[Added after your response below...]
This might do it. It becomes closer to your function as temperature is reduced. But it may not have good training dynamics, and there could be better solutions out there. One approach might be to start training with relatively large temperature, and reduce it as training progresses.
temperature = 1.0
fail_count=tf.reduce_sum(tf.math.sigmoid((vector-5.)/temperature))

How to map an array of values for y_true to a single value in order to compare to y_pred in a Tensorflow loss function (Tensorflow/Tensorflow Quantum)

I am trying to implement the circuits listed on page 8 in the following paper: https://arxiv.org/pdf/1905.10876.pdf using Tensorflow Quantum (TFQ). I have done so previously for a subset of circuits using Qiskit, and ended up with accuracies that can be found on page 14 in the following paper: https://arxiv.org/pdf/2003.09887.pdf. In TFQ, my accuracies are way down. I think this delta originates because in TFQ, I only used 1 observable Pauli Z operator on the first qubit, and the circuits do not seem to "transfer all knowledge" to the first qubit. I place this in quotes, because I am sure there is a better way to describe this. In Qiskit on the other hand, 16 states (4^2) get mapped to 2 states.
My question: how can I get my accuracies back up?
Potential answer a): some method of "transferring all information" to a single qubit, potentially an ancilla qubit, and doing a readout on this qubit.
Potential answer b) placing a Pauli Z observable on all qubits (4 in total), mapping half of the 16 states to a label 0 and the other half to a label 1. I attempted this in the code below.
My attempt at answer b):
I have a Tensorflow Quantum (TFQ) circuit implemented in Tensorflow. The circuit has multiple observables, which I try to bring together in my loss function. I prefer to use as many standard components as possible, but need to map my quantum states to a label in order to determine the loss. I think what I am trying to achieve is not unique to TFQ. I define my model in the following way:
def circuit():
data_qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
...
return circuit, [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])]
model_circuit, model_readout = circuit()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# compile model
model.compile(
loss = loss_mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=[])
in loss_mse (Mean Square Error), I receive a (32, 4) tensor for y_pred. One row could look like
[-0.2, 0.33, 0.6, 0.3]
This would have to be first mapped from [-1,1] to a binarized version of [0,1], so that it looks like:
[0, 1, 1, 1]
Now, a table lookup needs to happen, which tells if this combination is 0 or 1. Finally, the regular (y_true-y_pred)^2 can be performed by that row, followed by a np.sum on all rows. I tried to implement this:
def get_label(measurement):
if measurement == [0,0,0,0]: return 0
...
elif measurement == [1,1,1,1]: return 0
else: return -1
def py_call(y_true, y_pred):
# cast tensor to numpy
y_pred_np = np.asarray(y_pred)
loss = np.zeros((len(y_pred))) # could be a single variable with += within the loop
# evalaute all 32 samples
for pred in range(len(y_pred_np)):
# map, binarize and lookup
y_labelled = get_label([0 if y<0 else 1 for y in y_pred_np[pred]])
# regular loss comparison
loss[pred] = (y_labelled - y_true[pred])**2
# reduce
loss = np.sum(loss)/len(y_true)
return loss
#tf.function
def loss_mse(y_true, y_pred):
external_list = []
loss = tf.py_function(py_call, inp=[y_true, y_pred], Tout=[tf.float64])
return loss
However, the system appears to still expect a (32,4) tensor. I would have thought I could simply provide a single loss values (float). My question: how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function?
So it looks like there are a couple of things going on here. To answer your question
how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function ?
What you might want is some kind of tf.reduce_* function like tf.reduce_mean or tf.reduce_sum. This function will allow you to apply this reduction operation accross a given tensor axis allowing you to convert a tensor of shape (32, 4) to a tensor of shape (32,) or a tensor of shape (4,). Here is a quick snippet:
#tf.function
def my_loss(y_true, y_pred):
# y_true is shape (32, 4)
# y_pred is shape (32, 4)
# Scale from [-1, 1] to [0, 1]
y_true += 1
y_true /= 2
y_pred += 1
y_pred /= 2
# These are now both (32,) with the reduction of taking the mean applied along
# the second axis.
reduced_true = tf.reduce_mean(y_true, axis=1)
reduced_pred = tf.reduce_mean(y_pred, axis=1)
# Now a scalar loss.
loss = tf.reduce_mean((reduce_true - reduced_pred) ** 2)
return loss
Now the above isn't exactly what you want, since it's not super clear to me at least what exact reduction rules you have in mind for taking something like [0,1,1,1] -> 0 vs [0,0,0,0] -> 1.
Another thing I will also mention is that if you want JUST the sum of these Pauli Operators in cirq that you have term by term in the list [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])] and all you care about is the final sum of these expectations, you could just as easily do:
my_operator = sum([cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]),
cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])])
print(my_op)
Which should give something like:
cirq.PauliSum(cirq.LinearDict({frozenset({(cirq.GridQubit(0, 0), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 1), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 2), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 3), cirq.Z)}): (1+0j)}))
Which is also compatable as a readout operation in the PQC layer. Lastly if would recommend reading through some of the snippets and examples here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/PQC
and here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/Expectation
Which give a pretty good description of how the input and output signatures of the functions look as well as the shapes you can expect from them.

How to optimize multiple loss functions separately in Keras?

I am currently trying to build a deep learning model with three different loss functions in Keras. The first loss function is the typical mean squared error loss. The other two loss functions are the ones I built myself, which finds the difference between a calculation made from the input image and the output image (this code is a simplified version of what I'm doing).
def p_autoencoder_loss(yTrue,yPred):
def loss(yTrue, y_Pred):
return K.mean(K.square(yTrue - yPred), axis=-1)
def a(image):
return K.mean(K.sin(image))
def b(image):
return K.sqrt(K.cos(image))
a_pred = a(yPred)
a_true = a(yTrue)
b_pred = b(yPred)
b_true = b(yTrue)
empirical_loss = (loss(yTrue, yPred))
a_loss = K.mean(K.square(a_true - a_pred))
b_loss = K.mean(K.square(b_true - b_pred))
final_loss = K.mean(empirical_loss + a_loss + b_loss)
return final_loss
However, when I train with this loss function, it is simply not converging well. What I want to try is to minimize the three loss functions separately, not together by adding them into one loss function.
I essentially want to do the second option here Tensorflow: Multiple loss functions vs Multiple training ops but in Keras form. I also want the loss functions to be independent from each other. Is there a simple way to do this?
You could have 3 outputs in your keras model, each with your specified loss, and then keras has support for weighting these losses. It will also then generate a final combined loss for you in the output, but it will be optimising to reduce all three losses. Be wary with this though as depending on your data/problem/losses you might find it stalls slightly or is slow if you have losses fighting each other. This however requires use of the functional API. I'm unsure as to whether this actually implements separate optimiser instances, however I think this is as close you will get in pure Keras that i'm aware of without having to start writing more complex TF training regimes.
For example:
loss_out1 = layers.Dense(1, activation='sigmoid', name='loss1')(x)
loss_out2 = layers.Dense(1, activation='sigmoid', name='loss2')(x)
loss_out3 = layers.Dense(1, activation='sigmoid', name='loss3')(x)
model = keras.Model(inputs=[input],
outputs=[loss1, loss2, loss3])
model.compile(optimizer=keras.optimizers.RMSprop(1e-3),
loss=['binary_crossentropy', 'categorical_crossentropy', 'custom_loss1'],
loss_weights=[1., 1., 1.])
This should compile a model with 3 outputs at the end from (x) which would be above. When you compile you set the outputs as a list as well as set the losses and loss weights as a list. Note that when you fit() that you'll need to supply your target outputs three times as a list too e.g. [y, y, y] as your model now has three outputs.
I'm not a Keras expert, but it's pretty high-level and i'm not aware of another way using pure Keras. Hopefully someone can come correct me with a better solution!
Since there is only one output, few things that can be done:
1.Monitor the individual loss components to see how they vary.
def a_loss(y_true, y_pred):
a_pred = a(yPred)
a_true = a(yTrue)
return K.mean(K.square(a_true - a_pred))
model.compile(....metrics=[...a_loss,b_loss])
2.Weight the loss components where lambda_a & lambda_b are hyperparameters.
final_loss = K.mean(empirical_loss + lambda_a * a_loss + lambda_b * b_loss)
Use a different loss function like SSIM.
https://www.tensorflow.org/api_docs/python/tf/image/ssim

Tensorflow: my rnn always output same value, weights of rnn are not trained

I used tensorflow to implement a simple RNN model to learn possible trends of time series data and predict future values. However, the model always produces same values after training. Actually, the best model it got is:
y = b.
The RNN structure is:
InputLayer -> BasicRNNCell -> Dense -> OutputLayer
RNN code:
def RNN(n_timesteps, n_input, n_output, n_units):
tf.reset_default_graph()
X = tf.placeholder(dtype=tf.float32, shape=[None, n_timesteps, n_input])
cells = [tf.contrib.rnn.BasicRNNCell(num_units=n_units)]
stacked_rnn = tf.contrib.rnn.MultiRNNCell(cells)
stacked_output, states = tf.nn.dynamic_rnn(stacked_rnn, X, dtype=tf.float32)
stacked_output = tf.layers.dense(stacked_output, n_output)
return X, stacked_output
while in training, n_timesteps=1, n_input=1, n_output=1, n_units=2, learning_rate=0.0000001. And loss is calculated by mean squared error.
Input is a sequence of data in continuous days. Output is the data after the days of input.
(Maybe these are not good settings. But no matter how I change them, the results are almost same. So I just set these to help show them later.)
And I found out this is because weights and bias of BasicRNNCell are not trained. They keep same from beginning. And only the weights and bias of Dense keep changing. So in training, I got a prediction like these:
In the beginning:
loss: 1433683500.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
After a while:
loss: 175372340.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
The orange line indicates the true data, the blue line indicates results of my code. Through training, the blue line will keep going up until model gets a stable loss.
So I doubt whether I did a wrong implementation, so I generate a group of data with y = 10x + 5 for testing. This time, My model learns the correct results.
In the beginning:
In the end:
I have tried:
add more layers of both BasicRNNCell and Dense
increase rnn cell hidden num(n_units) to 128
decrease learning_rate to 1e-10
increase timesteps to 60
They all dont work.
So, my questions are:
Is it because my model is too simple? But I think the trend of my data is not so complicated to learn. At least something like y = ax + b will produce a smaller loss than y = b.
What may lead to these results?
Or how should I go on debugging?
And now, I double maybe BasicRNNCell is not fully realized, users should implement some functions of it? I have no experience with tensorflow before.
It seems your net is just not fit for that kind of data, or from another point of view, your data is badly scaled. When adding the 4 lines below after the split_data, I get some sort of learning behavior, similar to the one with the a*x+b case
data = read_data(work_dir, input_file)
plot_data(data)
input_data, output_data, n_batches = split_data(data, n_timesteps, n_input, n_output)
# scale input and output data
input_data = input_data-input_data[0]
input_data = input_data/np.max(input_data)*1000
output_data = output_data-output_data[0]
output_data = output_data/np.max(output_data)*1000

Pytorch how to get the gradient of loss function twice

Here is what I'm trying to implement:
We calculate loss based on F(X), as usual. But we also define "adversarial loss" which is a loss based on F(X + e). e is defined as dF(X)/dX multiplied by some constant. Both loss and adversarial loss are backpropagated for the total loss.
In tensorflow, this part (getting dF(X)/dX) can be coded like below:
grad, = tf.gradients( loss, X )
grad = tf.stop_gradient(grad)
e = constant * grad
Below is my pytorch code:
class DocReaderModel(object):
def __init__(self, embedding=None, state_dict=None):
self.train_loss = AverageMeter()
self.embedding = embedding
self.network = DNetwork(opt, embedding)
self.optimizer = optim.SGD(parameters)
def adversarial_loss(self, batch, loss, embedding, y):
self.optimizer.zero_grad()
loss.backward(retain_graph=True)
grad = embedding.grad
grad.detach_()
perturb = F.normalize(grad, p=2)* 0.5
self.optimizer.zero_grad()
adv_embedding = embedding + perturb
network_temp = DNetwork(self.opt, adv_embedding) # This is how to get F(X)
network_temp.training = False
network_temp.cuda()
start, end, _ = network_temp(batch) # This is how to get F(X)
del network_temp # I even deleted this instance.
return F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
def update(self, batch):
self.network.train()
start, end, pred = self.network(batch)
loss = F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
loss_adv = self.adversarial_loss(batch, loss, self.network.lexicon_encoder.embedding.weight, y)
loss_total = loss + loss_adv
self.optimizer.zero_grad()
loss_total.backward()
self.optimizer.step()
I have few questions:
1) I substituted tf.stop_gradient with grad.detach_(). Is this correct?
2) I was getting "RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time." so I added retain_graph=True at the loss.backward. That specific error went away.
However now I'm getting a memory error after few epochs (RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.cu:58
). I suspect I'm unnecessarily retaining graph.
Can someone let me know pytorch's best practice on this? Any hint / even short comment will be highly appreciated.
I think you are trying to implement generative adversarial network (GAN), but from the code, I don't understand and can't follow to what you are trying to achieve as there are a few missing pieces for a GAN to works. I can see there's a discriminator network module, DNetwork but missing the generator network module.
If to guess, when you say 'loss function twice', I assumed you mean you have one loss function for the discriminator net and another for the generator net. If that's the case, let me share how I would implement a basic GAN model.
As an example, let's take a look at this Wasserstein GAN Jupyter notebook
I'll skip the less important bits and zoom into the important ones here:
First, import PyTorch libraries and set up
# Set up batch size, image size, and size of noise vector:
bs, sz, nz = 64, 64, 100 # nz is the size of the latent z vector for creating some random noise later
Build a discriminator module
class DCGAN_D(nn.Module):
def __init__(self):
... truncated, the usual neural nets stuffs, layers, etc ...
def forward(self, input):
... truncated, the usual neural nets stuffs, layers, etc ...
Build a generator module
class DCGAN_G(nn.Module):
def __init__(self):
... truncated, the usual neural nets stuffs, layers, etc ...
def forward(self, input):
... truncated, the usual neural nets stuffs, layers, etc ...
Put them all together
netG = DCGAN_G().cuda()
netD = DCGAN_D().cuda()
Optimizer needs to be told what variables to optimize. A module automatically keeps track of its variables.
optimizerD = optim.RMSprop(netD.parameters(), lr = 1e-4)
optimizerG = optim.RMSprop(netG.parameters(), lr = 1e-4)
One forward step and one backward step for Discriminator
Here, the network can calculate gradient during the backward pass, depends on the input to this function. So, in my case, I have 3 type of losses; generator loss, dicriminator real image loss, dicriminator fake image loss. I can get gradient of loss function three times for 3 different net passes.
def step_D(input, init_grad):
# input can be from generator's generated image data or input image from dataset
err = netD(input)
err.backward(init_grad) # backward pass net to calculate gradient
return err # loss
Control trainable parameters [IMPORTANT]
Trainable parameters in the model are those that require gradients.
def make_trainable(net, val):
for p in net.parameters():
p.requires_grad = val # note, i.e, this is later set to False below in netG update in the train loop.
In TensorFlow, this part can be coded like below:
grad = tf.gradients(loss, X)
grad = tf.stop_gradient(grad)
So, I think this will answer your first question, "I substituted tf.stop_gradient with grad.detach_(). Is this correct?"
Train loop
You can see here how's the 3 different loss functions are being called here.
def train(niter, first=True):
for epoch in range(niter):
# Make iterable from PyTorch DataLoader
data_iter = iter(dataloader)
i = 0
while i < n:
###########################
# (1) Update D network
###########################
make_trainable(netD, True)
# train the discriminator d_iters times
d_iters = 100
j = 0
while j < d_iters and i < n:
j += 1
i += 1
# clamp parameters to a cube
for p in netD.parameters():
p.data.clamp_(-0.01, 0.01)
data = next(data_iter)
##### train with real #####
real_cpu, _ = data
real_cpu = real_cpu.cuda()
real = Variable( data[0].cuda() )
netD.zero_grad()
# Real image discriminator loss
errD_real = step_D(real, one)
##### train with fake #####
fake = netG(create_noise(real.size()[0]))
input.data.resize_(real.size()).copy_(fake.data)
# Fake image discriminator loss
errD_fake = step_D(input, mone)
# Discriminator loss
errD = errD_real - errD_fake
optimizerD.step()
###########################
# (2) Update G network
###########################
make_trainable(netD, False)
netG.zero_grad()
# Generator loss
errG = step_D(netG(create_noise(bs)), one)
optimizerG.step()
print('[%d/%d][%d/%d] Loss_D: %f Loss_G: %f Loss_D_real: %f Loss_D_fake %f'
% (epoch, niter, i, n,
errD.data[0], errG.data[0], errD_real.data[0], errD_fake.data[0]))
"I was getting "RuntimeError: Trying to backward through the graph a second time..."
PyTorch has this behaviour; to reduce GPU memory usage, during the .backward() call, all the intermediary results (if you have like saved activations, etc.) are deleted when they are not needed anymore. Therefore, if you try to call .backward() again, the intermediary results don't exist and the backward pass cannot be performed (and you get the error you see).
It depends on what you are trying to do. You can call .backward(retain_graph=True) to make a backward pass that will not delete intermediary results, and so you will be able to call .backward() again. All but the last call to backward should have the retain_graph=True option.
Can someone let me know pytorch's best practice on this
As you can see from the PyTorch code above and from the way things are being done in PyTorch which is trying to stay Pythonic, you can get a sense of PyTorch's best practice there.
If you want to work with higher-order derivatives (i.e. a derivative of a derivative) take a look at the create_graph option of backward.
For example:
loss = get_loss()
loss.backward(create_graph=True)
loss_grad_penalty = loss + loss.grad
loss_grad_penalty.backward()