How to process gradients with a Dimension size of None - tensorflow

Using AdamOptimizer, when I get the gradients of a 2d variable, the second dimension's size ends up being None, while the first dimension is the same size as the variable's first dimension. This makes it difficult to process the gradients, since a size of None isn't compatible with other sizes for most functions. When I get the gradients of a 1d variable, the gradient's dimension size is the same as the variable's. I haven't tried variables with more than 2 dimensions.
Is this a bug? Is there a way to specify what the size of the gradient should be through the compute_gradients function? Is there a way to process the gradient that gets around the size None issue?

TL;DR: It shouldn't matter, and you can process the gradients using the tf.train.AdamOptimizer as normal. If you are seeing shape-related errors, this most likely arises from one of the known dimensions not matching.
The presence of None in a gradient tensor's shape simply means that the size in that dimension could not be statically inferred. This is not necessarily a bug: the shapes of many operators depend on their data inputs, and the TensorFlow Python front-end uses a simple heuristic (i.e., only compute a limited set of ops with constant inputs) to decide what data inputs to evaluate. Almost all of the TensorFlow ops—excluding some image processing ops—will work on inputs whose shape is unknown (or only partially known), and perform checks at runtime instead.
The main way to process gradients is using Optimizer.apply_gradients(), which defers shape checking to the shape function for the ApplyAdam operator. This shape function asserts that the variable and gradient have the same shape, but the TensorShape.merge_with() method allows false positives in the presence of None in either of the shapes.
Finally, if you need to process the gradients at graph construction time, and your processing somehow depends on the gradients having known shapes, you can always use the Tensor.set_shape() method to copy the shape of the variable to the shape of the gradient, as these must be equivalent:
var = tf.Variable(...)
loss = ...
grad = tf.gradients(loss, [var])[0]
# `grad` and `var` must have the same shape.
grad.set_shape(var.get_shape())

Related

BroadcastGradientArgs no documentation provided

I am creating my custom ops. While inspecting the ops in back prop, I am coming across BroadcastGradientArgs.
Does anyone has any idea what does this do?
it is an internal op that returns the axis of reduction given two tensor shapes. Notice that the return values of this is always used for reduce_sum. ops that support broadcasting (an op involving a tensor of lesser rank or shape) needed to have a reduction function so that the resulting gradient has the same original size. It has the effect of summing individual gradients into one value.

Tensorflow: difference get_tensor_by_name vs get_operation_by_name?

The answer here says that one returns an operation while the other returns a tensor. That is pretty obvious from the name and from the documentation. However, suppose I do the following:
logits = tf.add(tf.matmul(inputs, weights), biases, name='logits')
I am following the pattern described in Tensorflow Mechanics 101. Should I restore it as an operation or as a tensor? I am afraid that if I restore it as a tensor I will only get the last computed values for the logits; nonetheless, the post here, seems to suggest that there is no difference or that I should just use get_tensor_by_name. The idea is to compute the logits for a new set of inputs and then make predictions accordingly.
Short answer: you can use both, get_operation_by_name() and get_tensor_by_name(). Long answer:
tf.Operation
When you call
op = graph.get_operation_by_name('logits')
... it returns an instance of type tf.Operation, which is a node in the computational graph, which performs some op on its inputs and produces one or more outputs. In this case, it's a plus op.
One can always evaluate an op in a session, and if this op needs some placehoder values to be fed in, the engine will force you to provide them. Some ops, e.g. reading a variable, don't have any dependencies and can be executed without placeholders.
In your case, (I assume) logits are computed from the input placeholder x, so logits doesn't have any value without a particular x.
tf.Tensor
On the other hand, calling
tensor = graph.get_tensor_by_name('logits:0')
... returns an object tensor, which has the type tf.Tensor:
Represents one of the outputs of an Operation.
A Tensor is a symbolic handle to one of the outputs of an Operation.
It does not hold the values of that operation's output, but instead
provides a means of computing those values in a TensorFlow tf.Session.
So, in other words, tensor evaluation is the same as operation execution, and all the restrictions described above apply as well.
Why is Tensor useful? A Tensor can be passed as an input to another Operation, thus forming the graph. But in your case, you can assume that both entities mean the same.

Optimizing a subset of a tensor in Tensor Flow

I have a free varaible (tf.variable) x, and I wish to minimize an error term with respect to subset of the tensor x (for example minimizing the error only with respect to the first row of 2D tensor).
One way is to compute the gradients and change the gradient to zero for the irrelevant parts of the tensor and apply the gradients. Is their another way?
You can use mask and tf.stop_gradient to selectively make the variable non-trainable: tf.stop_gradient(mask*x). The value in matrix mask 1 should denote parts to apply gradient and 0 otherwise.

Tensorflow: Copy variable-size matrix from one GPU to another and pretend copy has zero derivative

I have computed matrices of size [None, 1024] on each of two GPUs (call them "left GPU" and "right GPU"). The None represents the batch size. I want to copy the matrix from the right GPU to the left GPU (where it is treated as constant for differentiation purposes) and then multiply them:
result = tf.matmul(left_matrix, right_matrix_copied, transpose_b=True)
to obtain a square matrix of shape [None, None]. It's important that the matrix be square since I proceed to apply tf.diag_part to the matrix. (And in case you're wondering, I also use all the off-diagonal entries.)
I tried doing this by assigning the right matrix to a tf.Variable with trainable=False and then using assign with validate_shape=False, but I am still forced to specify the variable's initial shape statically (with no dimensions allowed to be None). And when I change the shape dynamically, the tf.diag_part op complains.
How can I do this?

Tensorflow RNN input size

I am trying to use tensorflow to create a recurrent neural network. My code is something like this:
import tensorflow as tf
rnn_cell = tf.nn.rnn_cell.GRUCell(3)
inputs = [tf.constant([[0, 1]], dtype=tf.float32), tf.constant([[2, 3]], dtype=tf.float32)]
outputs, end = tf.nn.rnn(rnn_cell, inputs, dtype=tf.float32)
Now, everything runs just fine. However, I am rather confused by what is actually going on. The output dimensions are always the batch size x the size of the rnn cell's hidden state - how can they be completely independent of the input size?
If my understanding is correct, the inputs are concatenated to the rnn's hidden state at each step, and then multiplied by a weight matrix (among other operations). This means that the dimensions of the weight matrix need to depend on the input size, which is impossible, because the rnn_cell is created before the inputs are even declared!
After seeing the answer to a question about tensorflow's GRU implementation, I've realized what's going on. Counter to my intuition, the GRUCell constructor doesn't create any weight or bias variables at all. Instead, it creates its own variable scope, and then instantiates the variables on demand when actually called. Tensorflow's variable scoping mechanism ensures that the variables are only created once, and shared across subsequent calls to the GRU.
I'm not sure why they decided to go with this rather confusing implementation, which is as far as I can tell is undocumented. To me it seems more appropriate to use python's object-level variable scoping to encapsulate the tensorflow variables within the GRUCell itself, rather than relying on an additional implicit scoping mechanism.