What are tf.IndexedSlices and dense gradients? Why do a keras optimizer distinguish between them? - tensorflow

I want to know what is the difference between tf.IndexedSlices and dense gradients. Why do a keras optimizer distinguish between them while updating the variables at each step?
All existing optimizers in keras (adam, sgd, nadam, ...) distinguish between tf.IndexedSlices gradients and dense gradients in order to update variables (in update_step() function):
if isinstance(gradient, tf.IndexedSlices):
update manner 1
else:# Dense gradients.
update manner 2 (other update manner)```
I want to write a custom optimizer and I want why should I distinguish between tf.IndexedSlices gradients and dense gradients?
thank you

Related

Difference between Keras and tensorflow implementation of LSTM with dropout

I was reviewing the documentation for the LSTM cell in tensorflow and Keras. In particular, I want to apply dropout as well. Here is what I have in Keras and would like to apply the same LSTM cell in tensorflow:
cell = LSTM(num_units_2, return_sequences=True, dropout=dropout, recurrent_dropout=dropout)(net)
Therefore, I know that I need to use tf.nn.rnn_cell.LSTMCell in tensorflow with num_units = num_units_2. Second, I need a DropoutWrapper as:
cell = tf.nn.rnn_cell.DropoutWrapper(cell)
Now, I want to apply dropout and recurrent_dropout similar to the Keras code. Therefore, I found that tensorflow's implementation of dropout will apply a different dropout mask at every time step unless variational_recurrent is set to True (Yet I'm not sure how variational_recurrent works in details).
Additionally, I'm not sure if the LSTM in Keras apply different Mask at each time step as well.
Second, I was confused about the difference between the output_keep_prob and the state_keep_prob as both mention:
output_keep_prob: unit Tensor or float between 0 and 1, output keep probability; if it is constant and 1, no output dropout will be added...
Any help is much appreciated!!
What variational dropout does
As far as I know, the main novelty of variational dropout is using the same dropout mask for all unrolled steps (as you said).
Difference between output_keep_prob and the state_keep_prob
output_keep_prob is the dropout rate applied to the output (h) of the LSTM cell where state_keep_prob is the dropout rate applied to the cell (c) of the LSTM state.
Dropout choice in Keras
Looking at the _generate_dropout_mask method in the LSTM source code and its use for the LSTMCell of Keras, I think Keras LSTM uses variational recurrent dropout only for the recurrent connections (i.e. self._recurrent_dropout_mask) . But I'm not 100% confident about this.

Convolutional Neural Network Loss

While Calculating the Loss Function. Can i manually Calculate Loss like
Loss = tf.reduce_mean(tf.square(np.array(Prediction) - np.array(Y)))
and then Optimize this Loss using Adam Optimizer
No.
Tensorflow loss functions typically accept tensors as input and also outputs a tensor. So np.array() wouldn't work.
In case of CNNs, you'd generally come across loss functions like cross-entropy, softmax corss-entropy, sigmoid cross-entropy etc. These are already in-built in tf.losses module. So you can use them directly.
The loss function that you're trying to apply looks like a Mean-squared loss. This is built in tf.losses as well. tf.losses.mean_squared_error.
Having said that, I've also implemented a few loss functions like cross-entropy using hand-coded formula such as: -tf.reduce_mean(tf.reduce_sum(targets * logProb)). This works equally fine, as long as the inputs targets and logProb are computed as tensors and not as numpy arrays.
No, actually you need to use tensor Variable for Loss, not use numpy.array(np.array(Prediction)).
Since tensorflow will eval these tensors in tensorflow engine.

Tensorflow assign the moving_variance and moving_average of tf.contrib.layers.batch_norm

I have a CNN model with some batch normalization layers within it. The batchnorm layer is constructed by tf.contrib.layers.batch_norm. That model works well in basic circumstances. But a problem is that I don't know how to assign the moving_variance and moving_mean of it.
In details, as the officail website describe, the batch norm layer have variance mean scale offset four parameters. The last two are tensorflow variables which I can tackle well. For the last two, even I can get them with tf.get_collection(tf.GraphKeys.UPDATE_OPS)), they are two tensors which I don't know how to assign them. In most cases these two parameters is set during the training phase.
I have also tried tf.get_collection(tf.GraphKeys.VARIABLES), I can get two tensorflow variables named tf.Variable 'BatchNorm/moving_mean and tf.Variable BachNorm/moving_Variance, althougn I can change these two variables's value with tf.assign, but the wierd thing is that the output of batchNorm doesn't change accordingly
Thanks for any suggestions!
From Tensorflow official site:
https://www.tensorflow.org/api_docs/python/tf/contrib/layers/batch_norm
Note: when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)

Training sparse model with Keras

I'm trying to train a sparse model, that is some of the model parameters have to remain zero during optimization.
Is this possible in Keras to define a mask for the parameters so that the optimizer would not update masked ones?
Unfortunately, freezing one layer would not work as I need to mask parameters in a more fine-grained fashion.
You can use tf.where to select elementwise between the parameters and tf.stop_gradient(parameters).

Implementing gradient descent in TensorFlow instead of using the one provided with it

I want to use gradient descent with momentum (keep track of previous gradients) while building a classifier in TensorFlow.
So I don't want to use tensorflow.train.GradientDescentOptimizer but I want to use tensorflow.gradients to calculate gradients and keep track of previous gradients and update the weights based on all of them.
How do I do this in TensorFlow?
TensorFlow has an implementation of gradient descent with momentum.
To answer your general question about implementing your own optimization algorithm, TensorFlow gives you the primitives to calculate the gradients, and update variables using the calculated gradients. In your model, suppose loss designates the loss function, and var_list is a python list of TensorFlow variables in your model (which you can get by calling tf.all_variables or tf.trainable_variables, then you can calculate the gradients w.r.t your variables as follows :
grads = tf.gradients(loss, var_list)
For the simple gradient descent, you would simply subtract the product of the gradient and the learning rate from the variable. The code for that would look as follows :
var_updates = []
for grad, var in zip(grads, var_list):
var_updates.append(var.assign_sub(learning_rate * grad))
train_op = tf.group(*var_updates)
You can train your model by calling sess.run(train_op). Now, you can do all sorts of things before actually updating your variables. For instance, you can keep track of the gradients in a different set of variables and use it for the momentum algorithm. Or, you can clip your gradients before updating the variables. All these are simple TensorFlow operations because the gradient tensors are no different from other tensors that you compute in TensorFlow. Please look at the implementations (Momentum, RMSProp, Adam) of some the fancier optimization algorithms to understand how you can implement your own.