How to select entries based on index returned by tf.argmin in Tensorflow - tensorflow

Given two matrices of the same shape Loss and Weights.
I need to return the result of tf.reduce_min(Loss, axis=0) and also the corresponding weights from the Weights matrix (selected from the same indices where tf.reduce_min selected its results).
I can use tf.argmin(Loss, 0) to find the indices with the minimal values. How do I use these indices to get the corresponding values from the Weights matrix? I think is possible to implement using tf.gather, but the results won't be differentiable. Any known solutions to this?

Gather is differentiable, on the values it returns. It wont be differentiable w.r.t. the indices, because that is mathematically impossible.

Related

Is it possible to get the sample indexes in Keras custom loss function?

In my Keras custom loss function I would like to know the sample indexes (as in the original input array) for the current y_true, y_pred tensors.
I know it sounds weird, but for calculating loss I need some additional information, what I prepare in an external array, which is not part neither of the input array, nor the expected output array.
The only solution I currently see is to include it to the expected output array as additional columns, so I got it in y_true, but I am not sure how disturbing it would be for the NN and the optimizer to have one extra node in the output layer, which's actual prediction is not correlated with the calculated loss...

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

BroadcastGradientArgs no documentation provided

I am creating my custom ops. While inspecting the ops in back prop, I am coming across BroadcastGradientArgs.
Does anyone has any idea what does this do?
it is an internal op that returns the axis of reduction given two tensor shapes. Notice that the return values of this is always used for reduce_sum. ops that support broadcasting (an op involving a tensor of lesser rank or shape) needed to have a reduction function so that the resulting gradient has the same original size. It has the effect of summing individual gradients into one value.

Keras model returns different values

To play with data, I have trained a linear regression with Keras+TensorFlow, and compared the first prediction computed in 3 different ways:
I got the weights from the model, and just used the linear regression formula p = w*X0 + b
I got predictions using the model.predict(X) method of Keras for the whole data array X and then took only the first element of it
I got prediction using the same method only for the first row of features X0 (the first sample)
In theory, all those methods should produce the very same value. However, in practice I do get values that are a bit different.
This difference is not that big, but still I wonder why is that the case, only due to float precision in python?
This is most likely due to the fact that matrix multiplications and convolutions are implemented in a way which is non-deterministic (if you change the batch size you change the order in which multiply-adds happen and since floating point numbers are not associative you get slightly different results).

Optimizing a subset of a tensor in Tensor Flow

I have a free varaible (tf.variable) x, and I wish to minimize an error term with respect to subset of the tensor x (for example minimizing the error only with respect to the first row of 2D tensor).
One way is to compute the gradients and change the gradient to zero for the irrelevant parts of the tensor and apply the gradients. Is their another way?
You can use mask and tf.stop_gradient to selectively make the variable non-trainable: tf.stop_gradient(mask*x). The value in matrix mask 1 should denote parts to apply gradient and 0 otherwise.