Optimizing a subset of a tensor in Tensor Flow - tensorflow

I have a free varaible (tf.variable) x, and I wish to minimize an error term with respect to subset of the tensor x (for example minimizing the error only with respect to the first row of 2D tensor).
One way is to compute the gradients and change the gradient to zero for the irrelevant parts of the tensor and apply the gradients. Is their another way?

You can use mask and tf.stop_gradient to selectively make the variable non-trainable: tf.stop_gradient(mask*x). The value in matrix mask 1 should denote parts to apply gradient and 0 otherwise.

Related

BroadcastGradientArgs no documentation provided

I am creating my custom ops. While inspecting the ops in back prop, I am coming across BroadcastGradientArgs.
Does anyone has any idea what does this do?
it is an internal op that returns the axis of reduction given two tensor shapes. Notice that the return values of this is always used for reduce_sum. ops that support broadcasting (an op involving a tensor of lesser rank or shape) needed to have a reduction function so that the resulting gradient has the same original size. It has the effect of summing individual gradients into one value.

Does tf.trace() only evaluate diagonal elements?

I have a TensorFlow tensor t with shape (d,d), a square matrix. I define the trace tensor tr = tf.trace(t). Now tr is evaluated, using session.run(tr): Is TensorFlow smart enough to only evaluate the diagonal elements of t, or are all elements of t evaluated first, and only then the trace is computed?
TensorFlow will compute the matrix first, then run the trace op to extract/sum the diagonal. Potentially this is something that XLA could optimize away if no other ops consume the full matrix (not sure if it does or not currently), but TensorFlow itself sees these ops as more or less black boxes.
If there are no consumers of the full matrix, maybe just do computations on a vector representing that diagonal? You could also use sparse tensors to avoid unnecessary computation while keeping track of indices.

What is "gate_gradients" atribute in tensorflow minimize() function in the optimzer class?

This is the link to TF optimizer class https://www.tensorflow.org/versions/r0.12/api_docs/python/train/optimizers
GATE_NONE: Take the simple case of a matmul op on two vectors 'x' and 'y'. let the output be L. Now gradient of L wrt x is y and gradient of L wrt y is xT (x transpose). with GATE_NONE it could so happen that the gradient wrt x is applied to modify x before the gradient for y is even calculated. Now when the gradient wrt y is calculated it would be computed equal to modified x which is an error. Of course it won't happen in such a simple case but you could imagine it could happen in more complex/extreme cases
GATE_OP: For each Op, make sure all gradients are computed before they are used. This prevents race conditions for Ops that generate gradients for multiple inputs where the gradients depend on the inputs. (You could see how this prevents the problem of GATE_NONE, though at the price of some parallelism).
GATE_GRAPH: Make sure all gradients for all variables are computed before any one of them is used. This provides the least parallelism but can be useful if you want to process all gradients before applying any of them.(an example of use case is clipping gradients according to global norm before applying them)
In the same page that you have linked, if you scroll down a little bit, it says:
gate_gradients argument that controls the degree of parallelism during the application of the gradients

Tensorflow: Copy variable-size matrix from one GPU to another and pretend copy has zero derivative

I have computed matrices of size [None, 1024] on each of two GPUs (call them "left GPU" and "right GPU"). The None represents the batch size. I want to copy the matrix from the right GPU to the left GPU (where it is treated as constant for differentiation purposes) and then multiply them:
result = tf.matmul(left_matrix, right_matrix_copied, transpose_b=True)
to obtain a square matrix of shape [None, None]. It's important that the matrix be square since I proceed to apply tf.diag_part to the matrix. (And in case you're wondering, I also use all the off-diagonal entries.)
I tried doing this by assigning the right matrix to a tf.Variable with trainable=False and then using assign with validate_shape=False, but I am still forced to specify the variable's initial shape statically (with no dimensions allowed to be None). And when I change the shape dynamically, the tf.diag_part op complains.
How can I do this?

How to process gradients with a Dimension size of None

Using AdamOptimizer, when I get the gradients of a 2d variable, the second dimension's size ends up being None, while the first dimension is the same size as the variable's first dimension. This makes it difficult to process the gradients, since a size of None isn't compatible with other sizes for most functions. When I get the gradients of a 1d variable, the gradient's dimension size is the same as the variable's. I haven't tried variables with more than 2 dimensions.
Is this a bug? Is there a way to specify what the size of the gradient should be through the compute_gradients function? Is there a way to process the gradient that gets around the size None issue?
TL;DR: It shouldn't matter, and you can process the gradients using the tf.train.AdamOptimizer as normal. If you are seeing shape-related errors, this most likely arises from one of the known dimensions not matching.
The presence of None in a gradient tensor's shape simply means that the size in that dimension could not be statically inferred. This is not necessarily a bug: the shapes of many operators depend on their data inputs, and the TensorFlow Python front-end uses a simple heuristic (i.e., only compute a limited set of ops with constant inputs) to decide what data inputs to evaluate. Almost all of the TensorFlow ops—excluding some image processing ops—will work on inputs whose shape is unknown (or only partially known), and perform checks at runtime instead.
The main way to process gradients is using Optimizer.apply_gradients(), which defers shape checking to the shape function for the ApplyAdam operator. This shape function asserts that the variable and gradient have the same shape, but the TensorShape.merge_with() method allows false positives in the presence of None in either of the shapes.
Finally, if you need to process the gradients at graph construction time, and your processing somehow depends on the gradients having known shapes, you can always use the Tensor.set_shape() method to copy the shape of the variable to the shape of the gradient, as these must be equivalent:
var = tf.Variable(...)
loss = ...
grad = tf.gradients(loss, [var])[0]
# `grad` and `var` must have the same shape.
grad.set_shape(var.get_shape())