I've implemented an attentive convolution layer in keras, as described in this paper.
You can see the code for it at this gist
I'm new to implementing custom layers and it is still very slow. I'm using a lot of tf.map_fn and I think this is the reason why it's so slow, but I don't know a different way to do this.
It would be nice if someone has some tips how to improve the layer or general tips on implementing custom layers like how to avoid backend (tensorflow) functions.
I'm using keras 2.1.3 and tensorflow 1.5 as backend.
Thanks
I don't see why you use tf.map_fn, you could avoid it everywhere...
Here are some hints (which may or may not make the code faster).
Casting
Do you really need to cast the values to float? If (at least) x[0] is an embedding, it's already a float, right? (Not sure about the nature of "context")
Lines 37 and 38:
text = x[0]
context = x[1]
Why map functions that are already supported in keras?
For instance, why do this (L42):
weighted_attentive_context = tf.map_fn(self._compute_attentive_context, (text, context), dtype=K.floatx())
When you can do this?
weighted_attentive_context = self._compute_attentive_context(text,context)
With:
def _comput_attentive_context(self,text,context):
Suggestion for _compute_attentive_context:
def _compute_attentive_context(self, text, context):
#computes the context-score for every vector like equation 2
temp = tf.matmul(text, self.We)
scores = tf.matmul(temp, K.transpose(context))
#why not?
scores_softmax = K.softmax(scores)
#computes the context featur_map like equation 4
res = tf.matmul(scores_softmax, context)
#why not?
res = self._weight_for_output(res)
return res
And why not use a K.conv1D instead of all these complicated repeat, concatenation, etc?
def _conv(self, x):
return K.conv1D(x, self.W1, padding='same')
#if you have special reasons for what you're doing, please share them in the comments,
#please also share the exact shapes of the inputs and desired outputs
#here, you should make self.W1 with shape (filterLength, em_dim, desired_output_dim)
Suggestion for call:
def call(self, x, mask=None):
#x is a list of two tensors
text = x[0]
context = x[1]
#applies bilinear energy funtion (text * We * context)
#and weights the computed feature map like in equation 6 (W2 * ci)
weighted_attentive_context = self._compute_attentive_context(text, context)
#does the actual convolution, this is still kind of hacky
conv = K.conv1D(text,self.W1,padding='same')
added = conv + weighted_attentive_context
batch = K.bias_add(added, self.bias)
return batch
Batch matrix multiplication
For those multiplications, you can use K.dot(), following this:
If batch x weights: K.dot(x, self.W)
If weights x batch: K.permute_dimensions(K.dot(self.W,x),(1,0,2))
Considering you have these shapes:
If batch x weights -> x: (batch, words, emb) | W: (emb, any)
If weights x batch -> W: (any, words) | x: (batch, words, emb)
The results will be:
If batch x weights: (words,any) <- this seems the logical choice
If weigths x batch: (any, emb)
Related
In my project, I have a number of cases where I have a Dataset instance and I need to get predictions from some model on every item in the dataset.
The model.predict() API is optimized perfectly for this, as shown in the documentation. However, there seems to be one major catch. I also happen to need the labels to compare with the predicted values, i.e. the dataset contains x,y pairs, and I'd like to end up with (y_predicted, y) pairs after the prediction is complete. This does not seem to be possible with the predict() API though, and I can't think of a clean way to 'split' the dataset so that the x's are fed into the model and the y's are retained to be joined back up with the predicted y's.
EDIT: I know it's quite simple to do by iterating over the dataset manually and calling the model directly, e.g.
for x, y in dataset:
y_pred = model(x)
result.append((y, y_pred))
However, this seems like it will be a fair bit slower than using the inbuilt predict() as Tensorflow won't be able to multi-thread/optimize the input pipeline.
Does anyone have a good way to accomplish this?
Given the concerns you mentioned, it may be best to overwrite predict to suit your needs. You don't actually need to overwrite that function though, instead only predict_step which is called by that function. Just use this class instead of Model:
class MyModel(tf.keras.Model):
def predict_step(self, data):
x, y = data
return self(x, training=False), y
If your model is currently Sequential, inherit from that instead. Basically the only change I made from the default implementation is to add , y to the model call result.
Note that this also makes some assumptions, such that your dataset consists of (input, label) batch pairs. You may need to adapt it slightly to your needs. Here is a minimal example:
import tensorflow as tf
import numpy as np
(imgs, lbls), (te_imgs, te_lbls) = tf.keras.datasets.mnist.load_data()
imgs = imgs.astype(np.float32).reshape((-1, 784)) / 255.
te_imgs = te_imgs.astype(np.float32).reshape((-1, 784)) / 255.
lbls = lbls.astype(np.int32)
te_lbls = te_lbls.astype(np.int32)
tr_data = tf.data.Dataset.from_tensor_slices((imgs, lbls)).shuffle(60000).batch(128)
te_data = tf.data.Dataset.from_tensor_slices((te_imgs, te_lbls)).batch(128)
class MyModel(tf.keras.Model):
def predict_step(self, data):
x, y = data
return self(x, training=False), y
inp = tf.keras.Input((784,))
logits = tf.keras.layers.Dense(10)(inp)
model = MyModel(inp, logits)
opt = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer=opt)
something = model.predict(te_data)
print(something[0].shape, something[1].shape)
This shows ((10000, 10), (10000,)) -- predict now returns a tuple of outputs, labels (this can be confirmed by inspecting the returned labels and comparing to the images in the test set).
I want to create a network with two parallel layers (same input is given to two different layers and output of them is combined with some mathematical operations). Having said that, I am not sure the back-propagation will be done by Keras automatically. As a simple example of custom RNN cell,
class Example(keras.layers.Layer):
def __init__(self, units, **kwargs):
super(Example, self).__init__(**kwargs)
self.units = units
self.state_size = units
self.la = keras.layers.Dense(self.units)
self.lb = keras.layers.Dense(self.units)
def call(self, inputs, states):
prev_output = states[0]
# parallel layers
a = tf.sigmoid(self.la(inputs))
b = tf.sigmoid(self.lb(inputs))
# combined using mathematical operation
output = (-1 * prev_output * a) + (prev_output * b)
return output, [output]
Now, the loss gradient to `la` and `lb` layers are different (gradient of loss wrt `a`, should be `-output` but wrt `b` should be `output`), will this be taken care by Keras automatically or should we create custom gradient functions?
Any insights and suggestions are much appreciated :)
Check the answer by Daniel Möller
Back-propagation will be taken care by Keras as long as all calculation is linked by tensor object, i.e don't cast tensor to another type like array, so don't worry about it.
For example of gradient tape, you can check the gradient at each layer by:
gradients = grad_tape.gradient(total_loss, model.trainable_variables)
gradient_of_last_layer = tf.reduce_max(gradients[-1])
I am trying to implement the circuits listed on page 8 in the following paper: https://arxiv.org/pdf/1905.10876.pdf using Tensorflow Quantum (TFQ). I have done so previously for a subset of circuits using Qiskit, and ended up with accuracies that can be found on page 14 in the following paper: https://arxiv.org/pdf/2003.09887.pdf. In TFQ, my accuracies are way down. I think this delta originates because in TFQ, I only used 1 observable Pauli Z operator on the first qubit, and the circuits do not seem to "transfer all knowledge" to the first qubit. I place this in quotes, because I am sure there is a better way to describe this. In Qiskit on the other hand, 16 states (4^2) get mapped to 2 states.
My question: how can I get my accuracies back up?
Potential answer a): some method of "transferring all information" to a single qubit, potentially an ancilla qubit, and doing a readout on this qubit.
Potential answer b) placing a Pauli Z observable on all qubits (4 in total), mapping half of the 16 states to a label 0 and the other half to a label 1. I attempted this in the code below.
My attempt at answer b):
I have a Tensorflow Quantum (TFQ) circuit implemented in Tensorflow. The circuit has multiple observables, which I try to bring together in my loss function. I prefer to use as many standard components as possible, but need to map my quantum states to a label in order to determine the loss. I think what I am trying to achieve is not unique to TFQ. I define my model in the following way:
def circuit():
data_qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
...
return circuit, [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])]
model_circuit, model_readout = circuit()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# compile model
model.compile(
loss = loss_mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=[])
in loss_mse (Mean Square Error), I receive a (32, 4) tensor for y_pred. One row could look like
[-0.2, 0.33, 0.6, 0.3]
This would have to be first mapped from [-1,1] to a binarized version of [0,1], so that it looks like:
[0, 1, 1, 1]
Now, a table lookup needs to happen, which tells if this combination is 0 or 1. Finally, the regular (y_true-y_pred)^2 can be performed by that row, followed by a np.sum on all rows. I tried to implement this:
def get_label(measurement):
if measurement == [0,0,0,0]: return 0
...
elif measurement == [1,1,1,1]: return 0
else: return -1
def py_call(y_true, y_pred):
# cast tensor to numpy
y_pred_np = np.asarray(y_pred)
loss = np.zeros((len(y_pred))) # could be a single variable with += within the loop
# evalaute all 32 samples
for pred in range(len(y_pred_np)):
# map, binarize and lookup
y_labelled = get_label([0 if y<0 else 1 for y in y_pred_np[pred]])
# regular loss comparison
loss[pred] = (y_labelled - y_true[pred])**2
# reduce
loss = np.sum(loss)/len(y_true)
return loss
#tf.function
def loss_mse(y_true, y_pred):
external_list = []
loss = tf.py_function(py_call, inp=[y_true, y_pred], Tout=[tf.float64])
return loss
However, the system appears to still expect a (32,4) tensor. I would have thought I could simply provide a single loss values (float). My question: how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function?
So it looks like there are a couple of things going on here. To answer your question
how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function ?
What you might want is some kind of tf.reduce_* function like tf.reduce_mean or tf.reduce_sum. This function will allow you to apply this reduction operation accross a given tensor axis allowing you to convert a tensor of shape (32, 4) to a tensor of shape (32,) or a tensor of shape (4,). Here is a quick snippet:
#tf.function
def my_loss(y_true, y_pred):
# y_true is shape (32, 4)
# y_pred is shape (32, 4)
# Scale from [-1, 1] to [0, 1]
y_true += 1
y_true /= 2
y_pred += 1
y_pred /= 2
# These are now both (32,) with the reduction of taking the mean applied along
# the second axis.
reduced_true = tf.reduce_mean(y_true, axis=1)
reduced_pred = tf.reduce_mean(y_pred, axis=1)
# Now a scalar loss.
loss = tf.reduce_mean((reduce_true - reduced_pred) ** 2)
return loss
Now the above isn't exactly what you want, since it's not super clear to me at least what exact reduction rules you have in mind for taking something like [0,1,1,1] -> 0 vs [0,0,0,0] -> 1.
Another thing I will also mention is that if you want JUST the sum of these Pauli Operators in cirq that you have term by term in the list [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])] and all you care about is the final sum of these expectations, you could just as easily do:
my_operator = sum([cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]),
cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])])
print(my_op)
Which should give something like:
cirq.PauliSum(cirq.LinearDict({frozenset({(cirq.GridQubit(0, 0), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 1), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 2), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 3), cirq.Z)}): (1+0j)}))
Which is also compatable as a readout operation in the PQC layer. Lastly if would recommend reading through some of the snippets and examples here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/PQC
and here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/Expectation
Which give a pretty good description of how the input and output signatures of the functions look as well as the shapes you can expect from them.
How can I get the loss function used by tf.keras.Model.fit(x, y) to compare two outputs within the graph instead of one output with externally supplied target data, y?
The manual says you can use tensors for target value which sounds like what I want but that you then also need the inputs to be tensors. But my inputs are numpy arrays and I don't think I should have to change that.
1 - Easy, best - maybe not good for memory
Why not just get the expected items for the loss already?
new_y_train = non_trainable_ops_model.predict(original_y_train)
nn_model.fit(x_train, new_y_train)
This sounds definitely the best way if your memory can handle this. Simpler model, faster training.
You can even save/load the new data:
np.save(name, new_y_train)
new_y_train = np.load(name)
2 - Make the model output the loss and use a dummy loss for compiling
Losses:
def dummy_loss(true, pred):
return pred
def true_loss(x):
true, pred = x
return loss_function(true, pred) #you can probably from keras.losses import loss_function
Model:
#given
nn_model = create_nn_model()
non_trainable_ops_model = create_nto_model()
nn_input = Input(nn_input_shape)
nto_input = Input(nto_input_shape)
nn_outputs = nn_model(nn_input)
nto_outputs = non_trainable_ops_model(nto_input)
loss = Lambda(true_loss)([nto_outputs, nn_outputs])
training_model = Model([nn_input, nto_input], loss)
training_model.compile(loss = dummy_loss, ...)
training_model.fit([nn_x_train, nto_x_train], np.zeros((len(nn_x_train),)))
3 - Use model.add_loss instead of compiling a loss
Following the same as the previous answer, you can:
training_model = Model([nn_input, nto_input], nn_outputs)
loss = true_loss([nto_outputs, nn_outputs])
training_model.add_loss(loss)
training_model.compile(loss=None, ...)
training_model.fit([nn_x_train, nto_x_train], None)
4 - Enable eager execution and make custom training loops
https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
Here is what I'm trying to implement:
We calculate loss based on F(X), as usual. But we also define "adversarial loss" which is a loss based on F(X + e). e is defined as dF(X)/dX multiplied by some constant. Both loss and adversarial loss are backpropagated for the total loss.
In tensorflow, this part (getting dF(X)/dX) can be coded like below:
grad, = tf.gradients( loss, X )
grad = tf.stop_gradient(grad)
e = constant * grad
Below is my pytorch code:
class DocReaderModel(object):
def __init__(self, embedding=None, state_dict=None):
self.train_loss = AverageMeter()
self.embedding = embedding
self.network = DNetwork(opt, embedding)
self.optimizer = optim.SGD(parameters)
def adversarial_loss(self, batch, loss, embedding, y):
self.optimizer.zero_grad()
loss.backward(retain_graph=True)
grad = embedding.grad
grad.detach_()
perturb = F.normalize(grad, p=2)* 0.5
self.optimizer.zero_grad()
adv_embedding = embedding + perturb
network_temp = DNetwork(self.opt, adv_embedding) # This is how to get F(X)
network_temp.training = False
network_temp.cuda()
start, end, _ = network_temp(batch) # This is how to get F(X)
del network_temp # I even deleted this instance.
return F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
def update(self, batch):
self.network.train()
start, end, pred = self.network(batch)
loss = F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
loss_adv = self.adversarial_loss(batch, loss, self.network.lexicon_encoder.embedding.weight, y)
loss_total = loss + loss_adv
self.optimizer.zero_grad()
loss_total.backward()
self.optimizer.step()
I have few questions:
1) I substituted tf.stop_gradient with grad.detach_(). Is this correct?
2) I was getting "RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time." so I added retain_graph=True at the loss.backward. That specific error went away.
However now I'm getting a memory error after few epochs (RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.cu:58
). I suspect I'm unnecessarily retaining graph.
Can someone let me know pytorch's best practice on this? Any hint / even short comment will be highly appreciated.
I think you are trying to implement generative adversarial network (GAN), but from the code, I don't understand and can't follow to what you are trying to achieve as there are a few missing pieces for a GAN to works. I can see there's a discriminator network module, DNetwork but missing the generator network module.
If to guess, when you say 'loss function twice', I assumed you mean you have one loss function for the discriminator net and another for the generator net. If that's the case, let me share how I would implement a basic GAN model.
As an example, let's take a look at this Wasserstein GAN Jupyter notebook
I'll skip the less important bits and zoom into the important ones here:
First, import PyTorch libraries and set up
# Set up batch size, image size, and size of noise vector:
bs, sz, nz = 64, 64, 100 # nz is the size of the latent z vector for creating some random noise later
Build a discriminator module
class DCGAN_D(nn.Module):
def __init__(self):
... truncated, the usual neural nets stuffs, layers, etc ...
def forward(self, input):
... truncated, the usual neural nets stuffs, layers, etc ...
Build a generator module
class DCGAN_G(nn.Module):
def __init__(self):
... truncated, the usual neural nets stuffs, layers, etc ...
def forward(self, input):
... truncated, the usual neural nets stuffs, layers, etc ...
Put them all together
netG = DCGAN_G().cuda()
netD = DCGAN_D().cuda()
Optimizer needs to be told what variables to optimize. A module automatically keeps track of its variables.
optimizerD = optim.RMSprop(netD.parameters(), lr = 1e-4)
optimizerG = optim.RMSprop(netG.parameters(), lr = 1e-4)
One forward step and one backward step for Discriminator
Here, the network can calculate gradient during the backward pass, depends on the input to this function. So, in my case, I have 3 type of losses; generator loss, dicriminator real image loss, dicriminator fake image loss. I can get gradient of loss function three times for 3 different net passes.
def step_D(input, init_grad):
# input can be from generator's generated image data or input image from dataset
err = netD(input)
err.backward(init_grad) # backward pass net to calculate gradient
return err # loss
Control trainable parameters [IMPORTANT]
Trainable parameters in the model are those that require gradients.
def make_trainable(net, val):
for p in net.parameters():
p.requires_grad = val # note, i.e, this is later set to False below in netG update in the train loop.
In TensorFlow, this part can be coded like below:
grad = tf.gradients(loss, X)
grad = tf.stop_gradient(grad)
So, I think this will answer your first question, "I substituted tf.stop_gradient with grad.detach_(). Is this correct?"
Train loop
You can see here how's the 3 different loss functions are being called here.
def train(niter, first=True):
for epoch in range(niter):
# Make iterable from PyTorch DataLoader
data_iter = iter(dataloader)
i = 0
while i < n:
###########################
# (1) Update D network
###########################
make_trainable(netD, True)
# train the discriminator d_iters times
d_iters = 100
j = 0
while j < d_iters and i < n:
j += 1
i += 1
# clamp parameters to a cube
for p in netD.parameters():
p.data.clamp_(-0.01, 0.01)
data = next(data_iter)
##### train with real #####
real_cpu, _ = data
real_cpu = real_cpu.cuda()
real = Variable( data[0].cuda() )
netD.zero_grad()
# Real image discriminator loss
errD_real = step_D(real, one)
##### train with fake #####
fake = netG(create_noise(real.size()[0]))
input.data.resize_(real.size()).copy_(fake.data)
# Fake image discriminator loss
errD_fake = step_D(input, mone)
# Discriminator loss
errD = errD_real - errD_fake
optimizerD.step()
###########################
# (2) Update G network
###########################
make_trainable(netD, False)
netG.zero_grad()
# Generator loss
errG = step_D(netG(create_noise(bs)), one)
optimizerG.step()
print('[%d/%d][%d/%d] Loss_D: %f Loss_G: %f Loss_D_real: %f Loss_D_fake %f'
% (epoch, niter, i, n,
errD.data[0], errG.data[0], errD_real.data[0], errD_fake.data[0]))
"I was getting "RuntimeError: Trying to backward through the graph a second time..."
PyTorch has this behaviour; to reduce GPU memory usage, during the .backward() call, all the intermediary results (if you have like saved activations, etc.) are deleted when they are not needed anymore. Therefore, if you try to call .backward() again, the intermediary results don't exist and the backward pass cannot be performed (and you get the error you see).
It depends on what you are trying to do. You can call .backward(retain_graph=True) to make a backward pass that will not delete intermediary results, and so you will be able to call .backward() again. All but the last call to backward should have the retain_graph=True option.
Can someone let me know pytorch's best practice on this
As you can see from the PyTorch code above and from the way things are being done in PyTorch which is trying to stay Pythonic, you can get a sense of PyTorch's best practice there.
If you want to work with higher-order derivatives (i.e. a derivative of a derivative) take a look at the create_graph option of backward.
For example:
loss = get_loss()
loss.backward(create_graph=True)
loss_grad_penalty = loss + loss.grad
loss_grad_penalty.backward()