Tensorflow infinity mask breaks gradient - numpy

I'm trying to do softmax over selected indices, using infinity mask to silent out the unwanted ones. However, the gradient of those unwanted entires become nan as opposed to 0.
The reason I didn't use boolean mask is that the mask indices are different in my batch, which can't end up with a nice matrix form. If there's workaround here I'll be more than happy to adopt.
The code I tested the infinity mask is
import numpy as np
import tensorflow as tf
a = tf.placeholder(tf.float32, [5])
inf_mask = tf.placeholder(tf.float32, [5])
b = tf.multiply(a, inf_mask)
sf = tf.nn.softmax(b)
loss = (sf[2] - 0)
grad = tf.gradients(loss, a)
sess = tf.Session()
a_np = np.ones([5])
np_mask = np.ones([5]) * 4
np_mask[1] = -np.inf
print sess.run([sf, grad], feed_dict={
a: a_np,
inf_mask: np_mask
})
sess.close()
The output is
[array([ 0.25, 0. , 0.25, 0.25, 0.25], dtype=float32), [array([-0.25, nan, 0.75, -0.25, -0.25], dtype=float32)]]
The mask is working but the gradient has a nan, which should have been 0 I think.

Related

Different cross entropy results from NumPy and PyTorch

My prediction is y_hat = [ 0.57,0.05,0.14,0.10,0.14] and target is
target =[ 1, 0, 0, 0, 0 ].
I need to calculate Cross Entropy loss by NumPy and Pytorch loss function.
Using NumPy my formula is -np.sum(target*np.log(y_hat)), and I got 0.5621189181535413
However, using Pytorch:
loss = nn.CrossEntropyLoss()
output = torch.FloatTensor([0.57,0.05,0.14,0.10,0.14])
label = torch.FloatTensor([1,0,0,0,0])
loss_value = loss(output, label)
print(loss_value)
Gives tensor(1.2586), which is different.
You need to apply the softmax function to your y_hat vector before computing cross-entropy loss. For example, you can use scipy.special.softmax().
>>> from scipy.special import softmax
>>> import numpy as np
>>> y_hat = [0.57, 0.05, 0.14, 0.10, 0.14]
>>> target =[1, 0, 0, 0, 0]
>>> y_hat = softmax(y_hat)
>>> -np.sum(target * np.log(y_hat))
1.2586146726011722
Which agrees with the result from Pytorch.

Tensorflow Embedding using Continous and Categorical Variable

Based on this post, I tried to create another model, where I'm adding both categorical and continous variables.
Please find the code below:
from __future__ import print_function
import pandas as pd;
import tensorflow as tf
import numpy as np
from sklearn.preprocessing import LabelEncoder
if __name__ == '__main__':
# 1 categorical input feature and a binary output
df = pd.DataFrame({'cat2': np.array(['o', 'm', 'm', 'c', 'c', 'c', 'o', 'm', 'm', 'm']),
'num1': np.random.rand(10),
'label': np.array([0, 0, 1, 1, 0, 0, 1, 0, 1, 1])})
encoder = LabelEncoder()
encoder.fit(df.cat2.values)
X1 = encoder.transform(df.cat2.values).reshape(-1,1)
X2 = np.array(df.num1.values).reshape(-1,1)
# X = np.concatenate((X1,X2), axis=1)
Y = np.zeros((len(df), 2))
Y[np.arange(len(df)), df.label.values] = 1
# Neural net parameters
training_epochs = 5
learning_rate = 1e-3
cardinality = len(np.unique(X))
embedding_size = 2
input_X_size = 1
n_labels = len(np.unique(Y))
n_hidden = 10
# Placeholders for input, output
cat2 = tf.placeholder(tf.int32, [None], name='cat2')
x = tf.placeholder(tf.float32, [None, 1], name="input_x")
y = tf.placeholder(tf.float32, [None, 2], name="input_y")
embed_matrix = tf.Variable(
tf.random_uniform([cardinality, embedding_size], -1.0, 1.0),
name="embed_matrix"
)
embed = tf.nn.embedding_lookup(embed_matrix, cat2)
inputs_with_embed = tf.concat([x, embedding_aggregated], axis=2, name="inputs_with_embed")
# Neural network weights
h = tf.get_variable(name='h2', shape=[inputs_with_embed, n_hidden],
initializer=tf.contrib.layers.xavier_initializer())
W_out = tf.get_variable(name='out_w', shape=[n_hidden, n_labels],
initializer=tf.contrib.layers.xavier_initializer())
# Neural network operations
#embedded_chars = tf.nn.embedding_lookup(embeddings, x)
layer_1 = tf.matmul(inputs_with_embed,h)
layer_1 = tf.nn.relu(layer_1)
out_layer = tf.matmul(layer_1, W_out)
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=out_layer, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
avg_cost = 0.
# Run optimization op (backprop) and cost op (to get loss value)
_, c = sess.run([optimizer, cost],
feed_dict={x: X2,cat2:X1, y: Y})
print("Optimization Finished!")
But I'm getting the following error. It seems I'm not concatenating the continous variable and embedding properly. But I'm not understanding how to fix it.
Please if someone can please guide me.
ValueError: Shape must be at least rank 3 but is rank 2 for 'inputs_with_embed_2' (op: 'ConcatV2') with input shapes: [?,1], [?,2], [] and with computed input tensors: input[2] = <2>.
Thanks!
If by embedding_agregated you mean embed (probably typo)
The error is that there is no axis=2 in your case , it should be axis=1
inputs_with_embed = tf.concat([x, embed], axis=1, name="inputs_with_embed")
embed has a shape [None, embedding_dimension] and x has a shape [None, 1]
They are both 2D tensors, so you have access to axis=0 or axis=1 (indexing at 0 not 1), therefore to have your input_with_embed of shape [None, embedding_dimension+1] you need to concat on the axis=1

Select weight of action from a tensorflow model

I have a small model used in a reinforcement learning context.
I can input a 2d tensor of states, and I get a 2d tensor of action weigths.
Let say I input two states and I get the following action weights out:
[[0.1, 0.2],
[0.3, 0.4]]
Now I have another 2d tensor which have the action number from which I want to get the weights:
[[1],
[0]]
How can I use this tensor to get the weight of actions?
In this example I'd like to get:
[[0.2],
[0.3]]
Similar to Tensorflow tf.gather with axis parameter, the indices are handled little different here:
a = tf.constant( [[0.1, 0.2], [0.3, 0.4]])
indices = tf.constant([[1],[0]])
# convert to full indices
full_indices = tf.stack([tf.range(indices.shape[0])[...,tf.newaxis], indices], axis=2)
# gather
result = tf.gather_nd(a,full_indices)
with tf.Session() as sess:
print(sess.run(result))
#[[0.2]
#[0.3]]
A simple way to do this is squeeze the dimensions of indices, element-wise multiply with corresponding one-hot vector and then expand the dimensions later.
import tensorflow as tf
weights = tf.constant([[0.1, 0.2], [0.3, 0.4]])
indices = tf.constant([[1], [0]])
# Reduce from 2d (2, 1) to 1d (2,)
indices1d = tf.squeeze(indices)
# One-hot vector corresponding to the indices. shape (2, 2)
action_one_hot = tf.one_hot(indices=indices1d, depth=weights.shape[1])
# Element-wise multiplication and sum across axis 1 to pick the weight. Shape (2,)
action_taken_weight = tf.reduce_sum(action_one_hot * weights, axis=1)
# Expand the dimension back to have a 2d. Shape (2, 1)
action_taken_weight2d = tf.expand_dims(action_taken_weight, axis=1)
sess = tf.InteractiveSession()
print("weights\n", sess.run(weights))
print("indices\n", sess.run(indices))
print("indices1d\n", sess.run(indices1d))
print("action_one_hot\n", sess.run(action_one_hot))
print("action_taken_weight\n", sess.run(action_taken_weight))
print("action_taken_weight2d\n", sess.run(action_taken_weight2d))
Should give you the following output:
weights
[[0.1 0.2]
[0.3 0.4]]
indices
[[1]
[0]]
indices1d
[1 0]
action_one_hot
[[0. 1.]
[1. 0.]]
action_taken_weight
[0.2 0.3]
action_taken_weight2d
[[0.2]
[0.3]]
Note: You can also do action_taken_weight = tf.reshape(action_taken_weight, tf.shape(indices)) instead of expand_dims.

Tensorflow dynamic_rnn propagates nans for batch size greater than 1

Hoping someone can help me understand an issue I have been having using LSTMs with dynamic_rnn in Tensorflow. As per this MWE, when I have a batch size of 1 with sequences that are incomplete (I pad the short tensors with nan's as opposed to zeros to highlight) everything operates as normal, the nan's in the short sequences are ignored as expected...
import tensorflow as tf
import numpy as np
batch_1 = np.random.randn(1, 10, 8)
batch_2 = np.random.randn(1, 10, 8)
batch_1[6:] = np.nan # lets make a short batch in batch 1 second sample of length 6 by padding with nans
seq_lengths_batch_1 = [6]
seq_lengths_batch_2 = [10]
tf.reset_default_graph()
input_vals = tf.placeholder(shape=[1, 10, 8], dtype=tf.float32)
lengths = tf.placeholder(shape=[1], dtype=tf.int32)
cell = tf.nn.rnn_cell.LSTMCell(num_units=5)
outputs, states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float32, sequence_length=lengths, inputs=input_vals)
last_relevant_value = states.h
fake_loss = tf.reduce_mean(last_relevant_value)
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(fake_loss)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value], feed_dict={input_vals: batch_1, lengths: seq_lengths_batch_1})
print(fl, lrv)
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value], feed_dict={input_vals: batch_2, lengths: seq_lengths_batch_2})
print(fl, lrv)
sess.close()
which outputs properly populated values of the ilk....
0.00659429 [[ 0.11608966 0.08498846 -0.02892204 -0.01945034 -0.1197343 ]]
-0.080244 [[-0.03018401 -0.18946587 -0.19128899 -0.10388547 0.11360413]]
However then when I increase my batch size up to size 3 for example, the first batch executes correctly but then somehow the second batch causes nans to start to propogating
import tensorflow as tf
import numpy as np
batch_1 = np.random.randn(3, 10, 8)
batch_2 = np.random.randn(3, 10, 8)
batch_1[1, 6:] = np.nan
batch_2[0, 8:] = np.nan
seq_lengths_batch_1 = [10, 6, 10]
seq_lengths_batch_2 = [8, 10, 10]
tf.reset_default_graph()
input_vals = tf.placeholder(shape=[3, 10, 8], dtype=tf.float32)
lengths = tf.placeholder(shape=[3], dtype=tf.int32)
cell = tf.nn.rnn_cell.LSTMCell(num_units=5)
outputs, states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float32, sequence_length=lengths, inputs=input_vals)
last_relevant_value = states.h
fake_loss = tf.reduce_mean(last_relevant_value)
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(fake_loss)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value], feed_dict={input_vals: batch_1, lengths: seq_lengths_batch_1})
print(fl, lrv)
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value], feed_dict={input_vals: batch_2, lengths: seq_lengths_batch_2})
print(fl, lrv)
sess.close()
giving
0.0533635 [[ 0.33622459 -0.0284576 0.11914439 0.14402215 -0.20783389]
[ 0.20805927 0.17591488 -0.24977767 -0.03432769 0.2944448 ]
[-0.04508523 0.11878576 0.07287208 0.14114542 -0.24467923]]
nan [[ nan nan nan nan nan]
[ nan nan nan nan nan]
[ nan nan nan nan nan]]
I have found this behavior quite strange, as I expected all values after the sequence lengths to be ignored as happens with a batch size of 1 but doesn't work with a batch size of 2 or more.
Obviously, nans do not get propagated if I use 0 as my padding value, but this doesn't inspire me with any confidence that dynamic_rnn is functioning as I am expecting it to.
Also I should mention that if I remove the optimisation step the issue doesnt occur so now I'm properly confused and after a day of trying many different permutations, I cant see why batch size would make any difference here
I did not trace it down to the exact operation but here is what I believe to be the case.
Why aren't values beyond sequence_length ignored? They are ignored in the sense that they are multiplied by 0 (they are masked out) when doing some operations. Mathematically, the result is always a zero, so they should have no effect. Unfortunately, nan * 0 = nan. So, if you give nan values in your examples, they propagate. You might wonder why TensorFlow does not ignore them completely, but only masks them. The reason is performance on modern hardware. It is much easier to do operations on a large regular shape with a bunch of zeros than on several small shapes (that you get from decomposing an irregular shape).
Why does it only happen on the second batch? In the first batch, the loss and last hidden state are computed using the original variable values. They are fine. Because you also do the optimizer update in the sess.run(), variables get updated and become nan in the first call. In the second call, the nans from variables spread to loss and hidden state.
How can I be confident that the values beyond sequence_length are really masked out? I modified your example to reproduce the issue but also made it deterministic.
import tensorflow as tf
import numpy as np
batch_1 = np.ones((3, 10, 2))
batch_1[1, 7:] = np.nan
seq_lengths_batch_1 = [10, 7, 10]
tf.reset_default_graph()
input_vals = tf.placeholder(shape=[3, 10, 2], dtype=tf.float32)
lengths = tf.placeholder(shape=[3], dtype=tf.int32)
cell = tf.nn.rnn_cell.LSTMCell(num_units=3, initializer=tf.constant_initializer(1.0))
init_state = tf.nn.rnn_cell.LSTMStateTuple(*[tf.ones([3, c]) for c in cell.state_size])
outputs, states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float32, sequence_length=lengths, inputs=input_vals,
initial_state=init_state)
last_relevant_value = states.h
fake_loss = tf.reduce_mean(last_relevant_value)
optimizer = tf.train.AdamOptimizer(learning_rate=0.1).minimize(fake_loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for _ in range(1):
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value],
feed_dict={input_vals: batch_1, lengths: seq_lengths_batch_1})
print "VARIABLES:", sess.run(tf.trainable_variables())
print "LOSS and LAST HIDDEN:", fl, lrv
If you replace the np.nan in batch_1[1, 7:] = np.nan with any number (e.g. try -1M, 1M, 0) , you will see that the values you get are the same. You can also run the loop for more iterations. As a further sanity check, if you set seq_lengths_batch_1 to something "wrong", e.g. [10, 8, 10], you can see that now the value you use in batch_1[1, 7:] = np.nan effects the output.

Basic Tensorflow: Define Tensor variable using existing variables

I have some very simple tensorflow code to rotate a vector:
import tensorflow as tf
import numpy as np
x = tf.placeholder(tf.float32, shape=(2, 1))
angle = tf.placeholder(tf.float32)
s_a = tf.sin(angle)
c_a = tf.cos(angle)
R = tf.Variable([[c_a, s_a], [-s_a, c_a]], tf.float32, expected_shape=(2,2))
#R = tf.Variable([[1.0, 0.0], [0.0, 1.0]], tf.float32)
rotated_v = tf.matmul(R,x)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
res = sess.run([init,rotated_v], feed_dict={x:np.array([[1.0],[1.0]]), angle:1.0})
print(res)
The code works fine when I hand-code the identity matrix. However, in its current form I get this error:
ValueError: initial_value must have a shape specified: Tensor("Variable/initial_value:0", dtype=float32)
I've tried specifying the shape in multiple ways, but I can't make this work.
What am I doing wrong?
I have figured out a way to achieve this (might not be the best way, but it works).
import tensorflow as tf
import numpy as np
x = tf.placeholder(tf.float32, shape=(2, 1))
angle = tf.placeholder(tf.float32)
s_a = tf.sin(angle)
c_a = tf.cos(angle)
R = tf.Variable([[1.0, 0.0], [0.0, 1.0]], tf.float32)
assignR = tf.assign(R, [[c_a, s_a], [-s_a, c_a]])
rotated_v = tf.matmul(R,x)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
newR = sess.run(assignR, feed_dict={angle:1.0})
print(newR)
print()
res = sess.run([rotated_v], feed_dict={x:np.array([[1.0],[1.0]])})
print(res)
This approach won't work, because s_a and c_a are the ops outputs, which values are uniquely determined by angle. You can't assign or update these nodes, so training them doesn't make any sense.
This line, on the other hand...
R = tf.Variable([[1.0, 0.0], [0.0, 1.0]], tf.float32)
... is a definition of an independent variable with initial value equal to identity matrix. This is perfectly valid. Since this variable is independent, you can assign a new value to it, which consists of s_a and c_a. Note that you can't initialize it with s_a and c_a, because the initializer is run before the values are fed into a session (so angle is unknown).