BroadcastGradientArgs no documentation provided - tensorflow

I am creating my custom ops. While inspecting the ops in back prop, I am coming across BroadcastGradientArgs.
Does anyone has any idea what does this do?

it is an internal op that returns the axis of reduction given two tensor shapes. Notice that the return values of this is always used for reduce_sum. ops that support broadcasting (an op involving a tensor of lesser rank or shape) needed to have a reduction function so that the resulting gradient has the same original size. It has the effect of summing individual gradients into one value.

Related

Does anybody know how to get the back-propagated errors to each layer in TensorFlow?

After the forward procedure, one loss and one error were generated for the batch data. Then according to the chain rule ,the error was back-propagated to the previous layers to update the parameters in each layer. Suppose I have the following network architecture:
I->(W1)->C1->(W2)->C2->(W3)->O
I is the input, O is the output, W1,W2,W3 is the weights for 3 layers. C1 and C2 are the outputs for the first two layers. With O and the ground truth, we obtain the loss and the error which will be back-propagated. My question is: In TensorFlow, are there any methods to get the errors back-propagated to C1 and C2?
I know we could get the parameter operators as follows:
W1_op = tf.get_default_graph().get_tensor_by_name('W1')
W1_op = ...
My final purpose is to check if the errors are right in my network because I cannot check if the gradient in some certain layer (a new user-defined op) of this network is computed correctly. I want to check its gradient by checking the errors before and after this layer (by viewing the errors and comparing the errors).
I know that we could use the tf.test.check_gradient to do gradient check, but it seems the output for gradient check of this new operator depends on the inputs. In some cases, the gradients check can be accepted (i.e., the theoretical gradient and the numerical gradient are very close evaluated by a threshold value, say, 1e-3), but in some other cases, the gradients check can fail, which depends on the parameters of that op. Please see the figure. The x-axis (log-scaled) is the parameter, the y-axis is the difference between the computed gradients and the evaluated gradients. As shown in this figure, in some parameter configuration, the difference is very small, but in other cases, the gradient check will be fail.
Thus, I'm not sure if this is good or valid operator that is suitable for learning.
In the Caffe framework, it seems those errors were saved in diff memory for each layer. I want to get these back-propagated errors in each layer. Does anybody know how to get that?

Optimizing a subset of a tensor in Tensor Flow

I have a free varaible (tf.variable) x, and I wish to minimize an error term with respect to subset of the tensor x (for example minimizing the error only with respect to the first row of 2D tensor).
One way is to compute the gradients and change the gradient to zero for the irrelevant parts of the tensor and apply the gradients. Is their another way?
You can use mask and tf.stop_gradient to selectively make the variable non-trainable: tf.stop_gradient(mask*x). The value in matrix mask 1 should denote parts to apply gradient and 0 otherwise.

Unaggregated gradients / gradients per example in tensorflow

Given a simple mini-batch gradient descent problem on mnist in tensorflow (like in this tutorial), how can I retrieve the gradients for each example in the batch individually.
tf.gradients() seems to return gradients averaged over all examples in the batch. Is there a way to retrieve gradients before aggregation?
Edit: A first step towards this answer is figuring out at which point tensorflow averages the gradients over the examples in the batch. I thought this happened in _AggregatedGrads, but that doesn't appear to be the case. Any ideas?
tf.gradients returns the gradient with respect to the loss. This means that if your loss is a sum of per-example losses, then the gradient is also the sum of per-example loss gradients.
The summing up is implicit. For instance if you want to minimize the sum of squared norms of Wx-y errors, the gradient with respect to W is 2(WX-Y)X' where X is the batch of observations and Y is the batch of labels. You never explicitly form "per-example" gradients that you later sum up, so it's not a simple matter of removing some stage in the gradient pipeline.
A simple way to get k per-example loss gradients is to use batches of size 1 and do k passes. Ian Goodfellow wrote up how to get all k gradients in a single pass, for this you would need to specify gradients explicitly and not rely on tf.gradients method
To partly answer my own question after tinkering with this for a while. It appears that it is possible to manipulate gradients per example while still working in batch by doing the following:
Create a copy of tf.gradients() that accepts an extra tensor/placeholder with example-specific factors
Create a copy of _AggregatedGrads() and add a custom aggregation method that uses the example-specific factors
Call your custom tf.gradients function and give your loss as a list of slices:
custagg_gradients(
ys=[cross_entropy[i] for i in xrange(batch_size)],
xs=variables.trainable_variables(),
aggregation_method=CUSTOM,
gradient_factors=gradient_factors
)
But this will probably have the same complexity as doing individual passes per example, and I need to check if the gradients are correct :-).
One way of retrieving gradients before aggregation is to use the grads_ys parameter. A good discussion is found here:
Use of grads_ys parameter in tf.gradients - TensorFlow
EDIT:
I haven't been working with Tensorflow a lot lately, but here is an open issue tracking the best way to compute unaggregated gradients:
https://github.com/tensorflow/tensorflow/issues/675
There is a lot of sample code solutions provided by users (including myself) that you can try based on your needs.

How to process gradients with a Dimension size of None

Using AdamOptimizer, when I get the gradients of a 2d variable, the second dimension's size ends up being None, while the first dimension is the same size as the variable's first dimension. This makes it difficult to process the gradients, since a size of None isn't compatible with other sizes for most functions. When I get the gradients of a 1d variable, the gradient's dimension size is the same as the variable's. I haven't tried variables with more than 2 dimensions.
Is this a bug? Is there a way to specify what the size of the gradient should be through the compute_gradients function? Is there a way to process the gradient that gets around the size None issue?
TL;DR: It shouldn't matter, and you can process the gradients using the tf.train.AdamOptimizer as normal. If you are seeing shape-related errors, this most likely arises from one of the known dimensions not matching.
The presence of None in a gradient tensor's shape simply means that the size in that dimension could not be statically inferred. This is not necessarily a bug: the shapes of many operators depend on their data inputs, and the TensorFlow Python front-end uses a simple heuristic (i.e., only compute a limited set of ops with constant inputs) to decide what data inputs to evaluate. Almost all of the TensorFlow ops—excluding some image processing ops—will work on inputs whose shape is unknown (or only partially known), and perform checks at runtime instead.
The main way to process gradients is using Optimizer.apply_gradients(), which defers shape checking to the shape function for the ApplyAdam operator. This shape function asserts that the variable and gradient have the same shape, but the TensorShape.merge_with() method allows false positives in the presence of None in either of the shapes.
Finally, if you need to process the gradients at graph construction time, and your processing somehow depends on the gradients having known shapes, you can always use the Tensor.set_shape() method to copy the shape of the variable to the shape of the gradient, as these must be equivalent:
var = tf.Variable(...)
loss = ...
grad = tf.gradients(loss, [var])[0]
# `grad` and `var` must have the same shape.
grad.set_shape(var.get_shape())

Caching Computations in TensorFlow

Is there a canonical way to reuse computations from a previously-supplied placeholder in TensorFlow? My specific use case:
supply many inputs (using one placeholder) simultaneously, all of which are fed through a network to obtain smaller representations
define a loss based on various combinations of these smaller representations
train on one batch at a time, where each batch uses some subset of the inputs, without recomputing the smaller representations
Here is the goal in code, but which is defective because the same computations are carried out again and again:
X_in = some_fixed_data
combinations_in = large_set_of_combination_indices
for combination_batch_in in batches(combinations_in, batch_size=128):
session.run(train_op, feed_dict={X: X_in, combinations: combination_batch_in})
Thanks.
The canonical way to share computed values across sess.Run() calls is to use a Variable. In this case, you could set up your graph so that when the Placeholders are fed, they compute a new value of the representation that is saved into a Variable. A separate portion of the graph reads those Variables to compute the loss. This will not work if you need to compute gradients through the part of the graph that computes the representation. Computing those gradients will require recomputing every Op in the encoder.
This is the kind of thing that should be solved automatically with CSE (common subexpression elimination). Not sure what the support in TensorFlow right now, might be kind of spotty, but there's optimizer_do_cse flag for Graph options which is defaulting to false, and you can set it to true using GraphConstructorOptions. Here's a C++ example of using GraphConstructorOptions (sorry, couldn't find a Python one)
If that doesn't work, you could do "manual CSE", ie, figure out which part is being needlessly recomputed, factor it out into separate Tensor, and reference that tensor in all the calculations.