Can TensorFlow cache (sub-)graph computations? - tensorflow

Can TensorFlow automatically cache computations if they involve
multiple calls to the same computation (sub-)graph?
For example, I have a matrix F in which each entry represents a
computation based on trainable variables W. My objective function
multiplies this matrix several times with different vectors (each
time with unchanged W).
Will TensorFlow recompute, for example, F[1,2] whenever I access
it, or will it cache that value?
In theory, one could precompute the matrix F given a fixed W,
such that each entry in F is a tf.constant. But that would
prevent the correct computation of the gradients of W.

TensorFlow performs a limited amount of caching, but it probably doesn't cover the case that you describe.
If you create a tf.Session with the following options, constant folding will be enabled:
config = tf.ConfigProto(graph_options=tf.GraphOptions(
optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L2)))
sess = tf.Session(config=config)
When you call sess.run() with this configuration, TensorFlow will evaluate the appropriate nodes to run, then identify the subgraph of those nodes whose outputs are constant, evaluate them, and cache the results. Therefore, it will avoid re-executing redundant computation.
However, in your question you mention that F is a function of some trainable variables. From TensorFlow's point of view, these variables are volatile—they may change at any time—so it does not cache values that are derived from these variables. If you want to reuse the same value for F multiple times, you could consider storing it in a tf.constant() so that the constant folding optimization is more useful.

Related

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

how does tensorflow calculate gradients *efficiently* from input to loss?

To calculate the derivative of an output layer of size N w.r.t an input of size M, we need a Jacobian matrix of size M x N. To calculate a complete gradient from loss to inputs using the chain rule, we would need a large number of such Jacobians stored in memory.
I assume that tensorflow does not calculate a complete Jacobian matrix for each step of the graph, but does something more efficient. How does it do it?
Thanks
TensorFlow uses Automatic Differentiation to compute gradients efficiently. Concretely, it defines a computation graph in which nodes are operations and each directed edge represents the partial derivative of a child with respect to its parent. The total derivative of an operation f with respect to x is then given by the sum over all path values from x to f, where each path value is the product of the partial derivatives of the operations on the edges.
More specifically, TensorFlow uses reverse differentiation, which involves a forward pass to compute the value of each node in the computation graph, and a backward pass to compute the partial derivative of the function f that we are differentiating with respect to every node in the graph. We need to repeat the backward pass for each dimension of function f, so the computational complexity is O(dim(f))*O(f), where dim(f) is the output dimensionality of function f.
Although this approach is memory intensive (it requires storing the values of all the nodes before running the backward pass), it is very efficient for machine learning, where we typically have a scalar function f (i.e. dim(f)=1).
You might find this resource useful.

Is there a way to measure the back-ward pass of a model?

There is a relevant question here already TensorFlow: Is there a way to measure FLOPS for a model?
However, the answer given by #Tobias Scheck is the forward pass stats.
Is there a way to measure/estimate the backward pass as well?
If you just want to get a quick number, you can simply add
grads = tf.gradients(C, [A, B])
to #Tobias Scheck's code to construct the gradient computation nodes. Then, subtract the new number (with gradient ops) from the original one (without gradient ops) to get the estimated flops.
A word of caution about using this method in larger projects. This method uses static analysis of the whole graph. This has a few problems including:
The flops from ops in a while loop will be added only once.
Ops that are never normally run (some TF functionalities can leave garbage ops in the graph) will be added.
This analysis heavily depends on shape inference. It might not be available for all ops.
This analysis depends on registering functions that can estimate the flops of a given op. There can be ops without such functions and such functions don't precisely model the flops done by the actual kernel your TF will pick to execute the op.
For more info see: https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/core/profiler/g3doc/profile_model_architecture.md
It is better to use this in conjunction with an actual run record (RunMetadata) or use a purely runtime based approach, e.g. Can I measure the execution time of individual operations with TensorFlow?, and do some filtering/aggregation on the results.

What's the differences between tf.GraphKeys.TRAINABLE_VARIABLES and tf.GraphKeys.UPDATE_OPS in tensorflow?

Here is doc of tf.GraphKeys in tensorflow, such as TRAINABLE_VARIABLES: the subset of Variable objects that will be trained by an optimizer.
And i know tf.get_collection(), which can find some tensor that you want.
When use tensorflow.contrib.layers.batch_norm(), the parameter updates_collections default value is GraphKeys.UPDATE_OPS.
How can we understand those collections, and difference in them.
Besides, we can find more in ops.py.
These are two different things.
TRAINABLE_VARIABLES
TRAINABLE_VARIABLES is the collection of variables or training parameters which should be modified when minimizing the loss. For example, these can be the weights determining the function performed by each node in the network.
How do variables get added to this collection? This happens automatically when you define a new variable with tf.get_variable, unless you specify
tf.get_variable(..., trainable=False)
When would you want a variable to be untrainable? This happens from time to time. For example, occasionally you will want to use a two-step approach in which you first train the entire network on a large, generic dataset, then fine-tune the network on a smaller dataset which is specifically related to your problem. In such cases, you might want to fine-tune only part of the network, e.g., the last layer. Specifying some variables as untrainable is one of the ways to do this.
UPDATE_OPS
UPDATE_OPS is a collection of ops (operations performed when the graph runs, like multiplication, ReLU, etc.), not variables. Specifically, this collection maintains a list of ops which need to run before each training step.
How do ops get added to this collection?
By definition, update_ops occur outside the regular flow of training by loss minimization, so generally you will be adding ops to this collection only under special circumstances. For example, when performing batch normalization, you want to recompute the batch mean and variance before each training step, and this is how it's done. The mechanics of batch normalization using tf.contrib.layers.batch_norm are described in more detail in this article.
Disagree with the previous answer.
Actually, everything is an OP in the tensorflow, the variables in the TRAINABLE_VARIABLES collections are also OPs, which is created by the OP tf.get_variable or tf.Variable.
As for the UPDATE_OPS collection, it usually include the moving average and moving variance, crated in the tf.layers.batch_norm function. These ops can also be regarded as variables, as their values are updated at each training step, just like the weights and bias.
The main difference is that the trainable variables participate the process of back propagation, while the variables in the UPDATE_OPS not. They only participate the inference process in the test mode, so so gridients are computed on these variable in the UPDATE_OPS .

Caching Computations in TensorFlow

Is there a canonical way to reuse computations from a previously-supplied placeholder in TensorFlow? My specific use case:
supply many inputs (using one placeholder) simultaneously, all of which are fed through a network to obtain smaller representations
define a loss based on various combinations of these smaller representations
train on one batch at a time, where each batch uses some subset of the inputs, without recomputing the smaller representations
Here is the goal in code, but which is defective because the same computations are carried out again and again:
X_in = some_fixed_data
combinations_in = large_set_of_combination_indices
for combination_batch_in in batches(combinations_in, batch_size=128):
session.run(train_op, feed_dict={X: X_in, combinations: combination_batch_in})
Thanks.
The canonical way to share computed values across sess.Run() calls is to use a Variable. In this case, you could set up your graph so that when the Placeholders are fed, they compute a new value of the representation that is saved into a Variable. A separate portion of the graph reads those Variables to compute the loss. This will not work if you need to compute gradients through the part of the graph that computes the representation. Computing those gradients will require recomputing every Op in the encoder.
This is the kind of thing that should be solved automatically with CSE (common subexpression elimination). Not sure what the support in TensorFlow right now, might be kind of spotty, but there's optimizer_do_cse flag for Graph options which is defaulting to false, and you can set it to true using GraphConstructorOptions. Here's a C++ example of using GraphConstructorOptions (sorry, couldn't find a Python one)
If that doesn't work, you could do "manual CSE", ie, figure out which part is being needlessly recomputed, factor it out into separate Tensor, and reference that tensor in all the calculations.