How to compute dot product with negative samples - cntk

CNTK currently provides function to perform Cosine distance with negative samples. I am wondering how one could possibly do a simple dot product with negative sampling in CNTK.

Assuming query is the thing you want to match up against candidates and normalize across all candidates in the batch, you can use something like this:
def all_pairs_loss(query, candidates):
qry_matrix = C.unpack_batch(query)
cnd_matrix = C.unpack_batch(candidates)
all_inner_products = C.to_batch(C.times_transpose(cnd_matrix, qry_matrix))
positive_inner_products = C.reduce_sum(qry * candidates, axis=0)
loss = C.reduce_log_sum_exp(all_inner_products) - positive_inner_products
return loss

Related

How to add after each iteration in tensorflow

I am trying to achieve the following:
compute the losses in the previous 25 predictions and sum them before
computing the gradient. I have tried this:
loss_summation=tf.Variable(0,dtype=tf.dtypes.float32,name="loss")
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=next_element[1],logits=logits2,name="xentropy")
loss=tf.math.reduce_sum(tf.reduce_mean(xentropy,name="loss"))
loss_summation=tf.assign(loss_summation,loss_summation+loss)
optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
gvs = optimizer.compute_gradients(loss_summation,[vars])
with tf.Session() as sess():
for i in range(25):
b=sess.run([loss_summation])
However optimizer.compute_gradients() complains that
None values not supported. How can go around this ?
I am actually trying to implement the following function(feedforward of LSTM) in tensorflow to predict the next word given the previous ones
def feedforward(self,x_s,hpre,targets,p_s):
fts,its,gts,css,ots,output,inputs=[],[],[],[],[],[],[]
losses=[]
hprev=hpre
hts=[hprev]
loss=0
losses=[]
previous_state=p_s
css.append(previous_state)
for x,y in zip(x_s,targets):
k=np.zeros((self.vocab_size,1))
k[x]=1
M_c=np.row_stack((hprev,k))
ft=self.sigmoid(np.dot(self.W1,M_c)+self.b1)
fts.append(ft)
it=self.sigmoid(np.dot(self.W2,M_c)+self.b2)
its.append(it)
gt=np.tanh(np.dot(self.W3,M_c)+self.b3)
gts.append(gt)
cs=(ft*previous_state)+(it*gt)
previous_state=cs
css.append(cs)
ot=self.sigmoid(np.dot(self.W4,M_c)+self.b4)
ots.append(ot)
ht=ot*np.tanh(cs)
hts.append(ht)
yt=self.softmax(np.dot(self.W5,ht)+self.b5)
hprev=ht
output.append(yt)
inputs.append(M_c)
loss+=-np.log(yt[y])
losses.append(loss)
return fts,its,gts,css,ots,output,hts,loss,hts[-1],css[-1],inputs
x_s is a list of integers representing words.
x_s=[0,1,2,3,4,5,6,7,8....,24]
targets is the list of integers expected i.e if x_s=0 then next letter is 1
targets=[1,2,3,4,5,6,7,8,9...,25]
The loss which is a summation of 25 losses is what will be minimized.
There are a few things you need to address here:
Is there a good reason not to just use larger batches? Are you trying to implement the lookahead optimizer or something?
You look like you're getting started with TensorFlow. Consider turning on eager execution with tf.enable_eager_execution(). TensorFlow 2.0 is coming soon, don't waste your time messing with tf.Sessions.
Variables are not differentiable. So accumulating the losses in a variable doesn't make any sense.
I would make a copy of all the model's variables, and accumulate new values there. Then, after N iterations assign those values back to the model. Something like:
model = tf.keras.Sequential(...)
vars = model.trainable_variables
weight_acc = [tf.Variable(var) for var in model.trainable_variables]
for n,(batch, label) in enumerate(dataset):
with tf.GradientTape() as tape:
pred = model(batch)
loss = cal_loss(batch, label)
grads = tape.gradients(loss, vars)
for g, a in zip(grad, weight_acc):
a.assign_add(learning_rate*g)
if n%25 == 0:
for a, v in zip(weight_acc, vars):
v.assign_add(lookahead_fraction*(a-v))

Distance between words in tensorflow embedding

I'd like to use one of the models on TensorFlow Hub to look at the distances between words (specifically this one https://tfhub.dev/google/nnlm-en-dim128/1). But I can't find a good example of how to find the distance between two words or two groups of words... is this something that is possible with an embedding like this?
I'm 100% not a Data Scientist and so this might be a complete lack of understanding so apologies if it's a dumb question.
Ideally I'd like to look at the distance of a single word compared to two different sets of words.
I think the most common measure of distance between two embedded vectors is the cosine similarity.
We can calculate the cosine similarity using the formula:
which we can translate into tensorflow code as follows:
def cosine_similarity(a, b):
mag_a = tf.sqrt(tf.reduce_sum(tf.multiply(a, a)))
mag_b = tf.sqrt(tf.reduce_sum(tf.multiply(b, b)))
return tf.reduce_sum(tf.multiply(a, b)) / (mag_a * mag_b)
so we have a complete example as follows:
import tensorflow as tf
import tensorflow_hub as hub
embed = hub.Module("https://tfhub.dev/google/nnlm-en-dim128/1")
embeddings = embed(["cat is on the mat", "tiger sat on the mat"])
def cosine_similarity(a, b):
mag_a = tf.sqrt(tf.reduce_sum(tf.multiply(a, a)))
mag_b = tf.sqrt(tf.reduce_sum(tf.multiply(b, b)))
return tf.reduce_sum(tf.multiply(a, b)) / (mag_a * mag_b)
a = embeddings[0]
b = embeddings[1]
cos_similarity = cosine_similarity(a, b)
with tf.Session() as sess:
sess.run(tf.initialize_all_tables())
sess.run(tf.global_variables_initializer())
print (sess.run(cos_similarity))
which outputs 0.78157.
Note that some folks advocate using a rearrangement to the formula which gives the same results (+/- minuscule 'rounding errors') and may or may not be slightly better optimised.
This alternative formula is calculated as:
def cosine_similarity(a, b):
norm_a = tf.nn.l2_normalize(a,0)
norm_b = tf.nn.l2_normalize(b,0)
return tf.reduce_sum(tf.multiply(norm_a,norm_b))
Personally, I can't see how the difference could be anything other than negligible and I happen know the first formulation so I tend to stick with it but I certainly make no claim that its best and don't claim to know which is fastest! :-)

Normalized Mutual Information in Tensorflow

Is that possible to implement normalized mutual information in Tensorflow? I was wondering if I can do that and if I will be able to differentiate it. Let's say that I have predictions P and labels Y in two different tensors. Is there an easy way to use normalized mutual information?
I want to do something similar to this:
https://course.ccs.neu.edu/cs6140sp15/7_locality_cluster/Assignment-6/NMI.pdf
Assume your clustering method gives probability predictions/membership functions p(c|x), e.g., p(c=1|x) is the probability of x in the first cluster. Assume y is the ground truth class label for x.
The normalized mutual information is .
The entropy H(Y) can be estimated following this thread: https://stats.stackexchange.com/questions/338719/calculating-clusters-entropy-python
By definition, the entropy H(C) is , where .
The conditional mutual information where , and .
All terms involving integral can be estimated using sampling, i.e., average over training samples. The overall NMI is differentiable.
I did not misunderstand your question. I was assuming you used a neural network model which outputs logits as you did not provide any info. Then you need to normalise the logits to get p(c|x).
There may be other ways to estimate NMI, but if you discretize the output of whatever model you use, you cannot differentiate them.
TensorFlow code
Assume we have label matrix p_y_on_x and cluster predictions p_c_on_x. Each row of them corresponds to an observation x; each column corresponds to the probability of x in each class and cluster (so each row sums up to one). Further assume uniform probability for p(x) and p(x|y).
Then NMI can then be estimated as below:
p_y = tf.reduce_sum(p_y_on_x, axis=0, keepdim=True) / num_x # 1-by-num_y
h_y = -tf.reduce_sum(p_y * tf.math.log(p_y))
p_c = tf.reduce_sum(p_c_on_x, axis=0) / num_x # 1-by-num_c
h_c = -tf.reduce_sum(p_c * tf.math.log(p_c))
p_x_on_y = p_y_on_x / num_x / p_y # num_x-by-num_y
p_c_on_y = tf.matmul(p_c_on_x, p_x_on_y, transpose_a=True) # num_c-by-num_y
h_c_on_y = -tf.reduce_sum(tf.reduce_sum(p_c_on_y * tf.math.log(p_c_on_y), axis=0) * p_y)
i_y_c = h_c - h_c_on_y
nmi = 2 * i_y_c / (h_y + h_c)
In practice, please be very careful on the probabilities as they should be positive to avoid numeric overflow in tf.math.log.
Please comment if you find any mistakes.

Keras custom loss function with binary (round) with tensorflow backend

I'm currently trying to implement a custom loss function (precision) with a binary outcome but Tensorflow backend refuses to use round function which is necessary to be used in order to generate a '0' or '1'.
As far as I have investigated, this is because Tensorflow defines the gradient of the round as None and the loss function can't return None.
I have currently implemented this custom loss to create as close as is possible '0' or '1' in R Keras interface.
precision_loss<-function(y_true,y_pred){
y_pred_pos = K$clip(y_pred, 0, 1)
#Custom sigmoid to generate '0' '1'
y_pred_pos = K$maximum(0,K$minimum(1,(y_pred_pos+0.0625)/0.125))
y_pred_neg = 1 - y_pred_pos
y_pos = K$clip(y_true, 0, 1)
#Custom sigmoid to generate '0' '1'
y_pos = K$maximum(0,K$minimum(1,(y_pos+0.0625)/0.125))
y_neg = 1 - y_pos
#Generate confusion matrix counts
tp = K$sum(y_pos*y_pred_pos)
tn = K$sum(y_neg*y_pred_neg)
fp = K$sum(y_neg*y_pred_pos)
fn = K$sum(y_pos*y_pred_neg)
return(1-(tp/(tp+fp+K$epsilon())))
}
Notice the "sigmoid" : K$maximum(0,K$minimum(1,(y_pos+0.0625)/0.125))
What I wanted to implement is a workaround for this one:
precision_loss<-function(y_true, y_pred){
y_pred_pos = K$round(K$clip(y_pred, 0, 1))
y_pred_neg = 1 - y_pred_pos
y_pos = K$round(K$clip(y_true, 0, 1))
y_neg = 1 - y_pos
#Generate confusion matrix counts
tp = K$sum(K$clip(y_pos * y_pred_pos,0,1))
tn = K$sum(K$clip(y_neg * y_pred_neg,0,1))
fp = K$sum(K$clip(y_neg * y_pred_pos,0,1))
fn = K$sum(K$clip(y_pos * y_pred_neg,0,1))
return(1-(tp/(tp+fp+K$epsilon())))
}
Some of you have an alternative implementation without using round to generate binary outcomes in the loss function?
PD: In custom metrics function the round is allowed
In order to build a binary loss function, it wouldn't be enough to just build the custom loss function itself. You would also have to pre-define the gradients.
Your high-dimensional loss function would be zero for some points and one for all others. For all non-continuous points in this space, it would be impossible to analytically compute a gradient (i.e. the concept of a gradient doesn't even exist for such points), so you would have to just define one. And for all the continuous points in this space (e.g. an open set in which all loss values are 1), the gradient would exist, but it would be zero, so you would also have to pre-define the gradient values, otherwise your weights wouldn't move at all.
That means either way you would have to define your own custom "gradient" computation function that replaces Keras' (i.e. TensorFlow's) automatic differentiation engine for that particular node in the graph (the loss function node).
You could certainly achieve this by modifying your local copy of Keras or TensorFlow, but nothing good can come from it.
Also, even if you managed to do this, consider this: If your loss function returns only 0 or 1, that means it can only distinguish between two states: The model's prediction is either 100% correct (0 loss) or it is not 100% correct (1 loss). The magnitude of the gradient would have to be the same for all non-100% cases. Is that a desirable property?
Your quasi-binary sigmoid solution has the same problem: The gradient will be almost zero almost everywhere, and in the few points where it won't be almost zero, it will be almost infinity. If you try to train a model with that loss function, it won't learn anything.
As you noticed a custom loss function need to be based on functions which have their gradients defined (in order to minimise the loss function), which is not necessary for a simple metric. Some functions like “round” and “sign” are difficult to use in loss function since their gradients are either null all the time or infinite which is not helpful for minimisation. That’s probably why their gradients are not defined, by default.
Then, you have two options:
Option 1: you use the round function but you need to add your custom gradient for round, to substitute it in backend.
Option 2: you define another loss function without using round
You chose option 2, which is the best option I think. But your “sigmoid” is very linear, so probably, not a good approximation of your “round” function. You could use an actual sigmoid which is slower due to the use of exponential but you could obtain a similar result with a modified softsign:
max_gradient=100
K$maximum(0,K$minimum(1,0.5*(1+(max_gradient*y_pos)/(1+ max_gradient*abs(y_pos)))))
The max_gradient coefficient can be used to make your edge more sharp, around 0.5. It defines the maximum gradient at 0.5.

How to shift a tensor using api in tensorflow, just like nump.roll() or shift ? [duplicate]

Lets say, that we do want to process images (or ndim vectors) using Keras/TensorFlow.
And we want, for fancy regularization, to shift each input by a random number of positions to the left (owerflown portions reappearing at the right side ).
How could it be viewed and solved:
1)
Is there any variation to numpy roll function for TensorFlow?
2)
x - 2D tensor
ri - random integer
concatenate(x[:,ri:],x[:,0:ri], axis=1) #executed for each single input to the layer, ri being random again and again (I can live with random only for each batch)
In TensorFlow v1.15.0 and up, you can use tf.roll which works just like numpy roll. https://github.com/tensorflow/tensorflow/pull/14953 .
To improve on the answer above you can do:
# size of x dimension
x_len = tensor.get_shape().as_list()[1]
# random roll amount
i = tf.random_uniform(shape=[1], maxval=x_len, dtype=tf.int32)
output = tf.roll(tensor, shift=i, axis=[1])
For older versions starting from v1.6.0 you will have to use tf.manip.roll :
# size of x dimension
x_len = tensor.get_shape().as_list()[1]
# random roll amount
i = tf.random_uniform(shape=[1], maxval=x_len, dtype=tf.int32)
output = tf.manip.roll(tensor, shift=i, axis=[1])
I just had to do this myself, and I don't think there is a tensorflow op to do np.roll unfortunately. Your code above looks basically correct though, except it doesn't roll by ri, rather by (x.shape[1] - ri).
Also you need to be careful in choosing your random integer that it is from range(1,x.shape[1]+1) rather than range(0,x.shape[1]), as if ri was 0, then x[:,0:ri] would be empty.
So what I would suggest would be something more like (for rolling along dimension 1):
x_len = x.get_shape().as_list()[1]
i = np.random.randint(0,x_len) # The amount you want to roll by
y = tf.concat([x[:,x_len-i:], x[:,:x_len-i]], axis=1)
EDIT: added missing colon after hannes' correct comment.