Different outputs with LSTM in PyTorch vs Tensorflow - tensorflow

I am trying to convert a Tensorflow(1.15) model to PyTorch model. Since I was getting very different loss values, I tried comparing the output of the LSTM in the forward pass for the same input. The declaration and initialization of the LSTM is given below:
Tensorflow Code
rnn_cell_video_fw = tf.contrib.rnn.LSTMCell(
num_units=self.options['rnn_size'],
state_is_tuple=True,
initializer=tf.orthogonal_initializer()
)
rnn_cell_video_fw = tf.contrib.rnn.DropoutWrapper(
rnn_cell_video_fw,
input_keep_prob=1.0 - rnn_drop,
output_keep_prob=1.0 - rnn_drop
)
sequence_length = tf.expand_dims(tf.shape(video_feat_fw)[1], axis=0)
initial_state = rnn_cell_video_fw.zero_state(batch_size=batch_size, dtype=tf.float32)
rnn_outputs_fw, _ = tf.nn.dynamic_rnn(
cell=rnn_cell_video_fw,
inputs=video_feat_fw,
sequence_length=sequence_length,
initial_state=initial_state,
dtype=tf.float32
)
PyTorch code
self.rnn_video_fw = nn.LSTM(self.options['video_feat_dim'], self.options['rnn_size'], dropout = self.options['rnn_drop'])
rnn_outputs_fw, _ = self.rnn_video_fw(video_feat_fw)
Initialization for LSTM in train.py
def init_weight(m):
if type(m) in [nn.LSTM]:
for param in m.parameters():
nn.init.orthogonal_(m.weight_hh_l0)
nn.init.orthogonal_(m.weight_ih_l0)
The output for tensorflow
The output for pytorch
The same is pretty much the case for every data item and my PyTorch model isn't converging. Is my suspicion of difference in output LSTM being the reason for it correct? If so, where am I going wrong?
Link to the paper
Link to TF code
let me know if anything else is required.

Related

Why does my model learn with Ragged Tensors but not Dense Tensors?

I have a string of letters that follow a "grammar." I also have boolean labels on my training set of whether the string follows "the grammar" or not. Basically, my model is trying to learn determine if a string of letters follows the rules. It's a fairly simple problem (I got it out of a textbook).
I am generating my dataset like this:
def generate_dataset(size):
good_strings = [string_to_ids(generate_string(embedded_reber_grammar))
for _ in range(size // 2)]
bad_strings = [string_to_ids(generate_corrupted_string(embedded_reber_grammar))
for _ in range(size - size // 2)]
all_strings = good_strings + bad_strings
X = tf.ragged.constant(all_strings, ragged_rank=1)
# X = X.to_tensor(default_value=0)
y = np.array([[1.] for _ in range(len(good_strings))] +
[[0.] for _ in range(len(bad_strings))])
return X, y
Notice the line X = X.to_tensor(default_value=0). If this line is commented out, my model learns just fine. However, if it is not commented out, it fails to learn and the validation set performs the same as chance (50-50).
Here is my actual model:
np.random.seed(42)
tf.random.set_seed(42)
embedding_size = 5
model = keras.models.Sequential([
keras.layers.InputLayer(input_shape=[None], dtype=tf.int32, ragged=True),
keras.layers.Embedding(input_dim=len(POSSIBLE_CHARS) + 1, output_dim=embedding_size),
keras.layers.GRU(30),
keras.layers.Dense(1, activation="sigmoid")
])
optimizer = keras.optimizers.SGD(lr=0.02, momentum = 0.95, nesterov=True)
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=5, validation_data=(X_valid, y_valid))
I am using 0 as the default value for the dense tensors. The strings_to_ids doesn't use 0 for any of the values but instead starts at 1. Also, when I switch to using a Dense tensor I change ragged=True to False. I have no idea why using a dense tensor causes the model to fail, as I've used dense tensors before in similar exercises.
For additional details, see the solution from the book (exercise 8) or my own colab notebook.
So turns out the answer was that the shape of the dense tensor was different across the training set and validation set. This was because the longest sequence differed in length between the two sets (same with the test set).

How to use tensorflow.distributions in a custom loss function for a keras model

For a Deep learning model I defined with tf2.0 keras I need to write a custom loss function.
As this will depend on stuff like entropy and normal log_prob, it would really make my life less misrable if I could use tf.distributions.Normal and use two model outpus as mu and sigma respectivly.
However, as soon as I put this into my loss function, I get the Keras error that no gradient is defined for this function.
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
I tried encalpsulating the call in a tf.contrib.eager.Variable as I read somewhere. Did not help.
What is the trick to use them? I don't see a reason from the fundamental arcitecture why I should not be able to use them in a mixed form.
#this is just an example which does not really give a meaningful result.
import tensorflow as tf
import tensorflow.keras as K
import numpy as np
def custom_loss_fkt(extra_output):
def loss(y_true,y_pred):
dist = tf.distributions.Normal(loc=y_pred,scale=extra_output)
d = dist.entropy()
return K.backend.mean(d)
return loss
input_node = K.layers.Input(shape=(1,))
dense = K.layers.Dense(8,activation='relu')(input_node)
#dense = K.layers.Dense(4,activation='relu')(dense)
out1 = K.layers.Dense(4,activation='linear')(dense)
out2 = K.layers.Dense(4,activation ='linear')(dense)
model = K.Model(inputs = input_node, outputs = [out1,out2])
model.compile(optimizer = 'adam', loss = [custom_loss_fkt(out2),custom_loss_fkt(out1)])
model.summary()
x = np.zeros((1,1))
y1 = np.array([[0.,0.1,0.2,0.3]])
y2 = np.array([[0.1,0.1,0.1,0.1]])
model.fit(x,[y1,y2],epochs=1000,verbose=0)
print(model.predict(x))

ValueError: No gradients provided for any variable when tensorflow operations added on keras output

I have a pre-trained Keras Sequential model called agent, and I'm trying to fine-tune it with a loss function.
json_file = open('model/prior_model_RMSprop.json', 'r')
json_model = json_file.read()
json_file.close()
agent = model_from_json(json_model)
prior = model_from_json(json_model)
# load weights into model
agent.load_weights('model/model_RMSprop.h5')
prior.load_weights('model/model_RMSprop.h5')
agent_output = agent.output
prior_output = prior.output
loss = tf.reduce_mean(tf.square(agent_output - prior_output))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
So far, everything works fine. However, when I add some basic tensorflow operations, the error happens
agent_logits = tf.cast(tf.argmax(agent_output, axis = 2), dtype = tf.float32)
prior_logits = tf.cast(tf.argmax(prior_output, axis = 2), dtype = tf.float32)
loss = tf.reduce_mean(tf.square(agent_logits - prior_logits))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
ValueError: No gradients provided for any variable
So the tensorflow operations break the connection between the model and the loss function? I've been stucked here for almost 2 weeks so pls help. I'm also not very clear about how to update a Keras model's trainable weights with the loss function I defined. Any hints or related links will be appreciated!!!

tensorflow CNN with complex features and labels?

I recently found a paper where they used a CNN with complex 2D-feature-maps as an input. However, there Network also outputs a complex output vector. They used Keras with tensorflow backend.
Here is the link: https://arxiv.org/pdf/1802.04479.pdf
I asked myself if it is possible to build complex Deep Neural Networks like CNNs with tensorflow. As far as i know it is not possible. Did i missed something?
There are other related questions which adresses the same problem with no answer: Complex convolution in tensorflow
when building a realy senseless model with real number in and output all works correct:
import tensorflow as tf
from numpy import random, empty
n = 10
feature_vec_real = random.rand(1,n)
X_real = tf.placeholder(tf.float64,feature_vec_real.shape)
def model(x):
out = tf.layers.dense(
inputs=x,
units=2
)
return out
model_output = model(X_real)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
output = sess.run(model_output,feed_dict={X_real:feature_vec_real})
but when using complex inputs:
import tensorflow as tf
from numpy import random, empty
n = 10
feature_vec_complex = empty(shape=(1,n),dtype=complex)
feature_vec_complex.real = random.rand(1,n)
feature_vec_complex.imag = random.rand(1,n)
X_complex = tf.placeholder(tf.complex128,feature_vec_complex.shape)
def complex_model(x):
out = tf.layers.dense(
inputs=x,
units=2
)
return out
model_output = complex_model(X_complex)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
output = sess.run(model_output,feed_dict={X_complex:feature_vec_complex})
i get the following error:
ValueError: An initializer for variable dense_7/kernel of is required
So what is the correct way to initialize the weights of the dense kernel when having complex inputs?
I know there is the possibility to handle complex numbers as two different layers in the network. But this is not what i want.
Thanks for your help!

adding one hot encoding throws error in previously working code in Tensorflow

with tf.variable_scope("rnn_seq2seq"):
w = tf.get_variable("proj_w", [num_units, seq_width])
w_t = tf.transpose(w)
b = tf.get_variable("proj_b", [seq_width])
output_projection=(w,b)
output,state = rnn_seq2seq(enc_inputs,dec_inputs,cell,output_projection=output_projection,feed_previous=False)
weights=[tf.ones([batch_size * dec_steps])]
loss=[]
for i in xrange(dec_steps -1):
logits = tf.nn.xw_plus_b(output[i],output_projection[0],output_projection[1])
If I introduce one hot encoding on the logits here, the program gives error later although both returns the same dimensions. If I comment out this line, the program does not give any error.
prev = logits
logits = tf.to_float(tf.equal(prev,tf.reduce_max(prev,reduction_indices=[1],keep_dims=True)))
print prev
print logits
Tensor("rnn_seq2seq/xw_plus_b:0", shape=TensorShape([Dimension(800), Dimension(14)]), dtype=float32)
Tensor("rnn_seq2seq/ToFloat:0", shape=TensorShape([Dimension(800), Dimension(14)]), dtype=float32)
Rest of code:
crossent =tf.nn.softmax_cross_entropy_with_logits(logits,dec_inputs[i+1],name="SequenceLoss/CrossEntropy{0}".format(i))
loss.append(crossent)
cost = tf.reduce_sum(tf.add_n(loss))
final_state = state[-1]
tvars = tf.trainable_variables()
grads,norm = tf.clip_by_global_norm(tf.gradients(cost,tvars),5)
lr = tf.Variable(0.0,name="learningRate")
optimizer = tf.train.GradientDescentOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads,tvars))
---> 23 grads,norm = tf.clip_by_global_norm(tf.gradients(cost,tvars),5)
ValueError: List argument 'values' to 'Pack' Op with length 0 shorter
than minimum length 1.
Neural networks can only be trained if all the operations they perform are differentiable. The "one-hot" step you apply is not differentiable, and hence such a neural network cannot be trained using any gradient descent-based optimizer (= any optimizer that tensor flow implements).
The general approach is to use softmax (which is differentiable) during training to approximate one-hot encoding (and your model already has softmax following computing logits, so commenting out the "one-hot" is actually all you need to do).