Does Tensorflow simplify a computational graph? - tensorflow

I have a simple question and I was also searching already quiet a bit, but maybe I'm using the wrong keywords.
How does Tensorflow handle a given graph? If one has the simple graph:
x = tf.constant(1.0, name='input')
w = tf.constant0.8, name='weight')
b = tf.constant0.8, name='bias')
y_1 = tf.mul(w, x, name='output_1')
y_2 = tf.add(y_1, b, name='output_1')
The arithmetic statement is of course given by the computational graph, but is Tensorflow then kind of compiling and simplifying it in terms of saving time by not copying memories, etc.? So that it a 'condensed' version of the computational kernel is executed on the 'device' like CPU or GPU?
So that it reduces to something like that:
y_2 = tf.add(tf.mul(w, x), b, name='output_1')
Maybe somebody knows a good resource to learn more about how exactly Tensorflow runs under the hood without looking too deep into the source-code.
Thank you very much in advance!

TensorFlow includes various optimizations that can have the effect of simplifying a dataflow graph. In particular:
TensorFlow will apply common subexpression elimination to avoid performing redundant computation. In the case of your example, this will not have much effect, but TensorFlow will observe that w and b are the same constant, and replace them with a single value.
TensorFlow will apply constant propagation so that (computed) values that are the same in every execution of a subgraph will only be computed once. In your example, the entire expression is a constant, so TensorFlow will replace it with a single tf.constant() value corresponding to the result (1.6).
If you use the experimental XLA compiler, TensorFlow will make more aggressive simplifications, and may be able to replace a subgraph with a single TensorFlow kernel, containing just-in-time compiled code. If in your example x were a tf.placeholder(), the remainder of the computation could be compiled into a single kernel with one input and one output.

Related

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

How is get_updates() of optimizers.SGD used in Keras during training?

I am not familiar with the inner workings of Keras and have difficulty understanding how Keras uses the get_updates() function of optimizers.SGD during training.
I searched quite a while on the internet, but only got few details. Specifically, my understanding is that the parameters/weights update rule of SGD is defined in the get_updates() function. But it appears that get_updates() isn't literally called in every iteration during training; otherwise 'moments' wouldn't carry from one iteration to the next to implement momentum correctly, as it's reset in every call, c.f. optimizers.py:
shapes = [K.get_variable_shape(p) for p in params]
moments = [K.zeros(shape) for shape in shapes]
self.weights = [self.iterations] + moments
for p, g, m in zip(params, grads, moments):
v = self.momentum * m - lr * g # velocity
self.updates.append(K.update(m, v))
As pointed out in https://github.com/keras-team/keras/issues/7502, get_updates() only defines 'a symbolic computation graph'. I'm not sure what that means. Can someone give a more detailed explanation of how it works?
For example, how is the 'v' computed in one iteration got passed to 'moments' in the next iteration to implement momentum? I'd also appreciate it if someone can point me to some tutorial about how this works.
Thanks a lot! (BTW, I'm using tensorflow, if it matters.)
get_updates() defines graph operations that update the gradients.
When the graph is evaluated for training it will look somehow like this:
forward passes compute a prediction value
loss computes a cost
backward passes compute gradients
gradients are updated
Updating the gradients is a graph computation itself; i.e. the snippet of code that you quote defines how to perform the operation by specifying which tensors are involves and what math operations occur. The math operations themselves are not occurring at that point.
moments is a vectors of tensors defined in the code above. The code creates a graph operation that updates each moments element.
Every iteration of the graph will run this update operation.
The following link tries to explain the concept of the computational graph in TensorFlow:
https://www.tensorflow.org/guide/graphs
Keras uses the same underlying ideas but abstract the user from having to deal with the low level details. Defining a model in traditional TensorFlow 1.0 API requires a much higher level of detail.

Is there a way to measure the back-ward pass of a model?

There is a relevant question here already TensorFlow: Is there a way to measure FLOPS for a model?
However, the answer given by #Tobias Scheck is the forward pass stats.
Is there a way to measure/estimate the backward pass as well?
If you just want to get a quick number, you can simply add
grads = tf.gradients(C, [A, B])
to #Tobias Scheck's code to construct the gradient computation nodes. Then, subtract the new number (with gradient ops) from the original one (without gradient ops) to get the estimated flops.
A word of caution about using this method in larger projects. This method uses static analysis of the whole graph. This has a few problems including:
The flops from ops in a while loop will be added only once.
Ops that are never normally run (some TF functionalities can leave garbage ops in the graph) will be added.
This analysis heavily depends on shape inference. It might not be available for all ops.
This analysis depends on registering functions that can estimate the flops of a given op. There can be ops without such functions and such functions don't precisely model the flops done by the actual kernel your TF will pick to execute the op.
For more info see: https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/core/profiler/g3doc/profile_model_architecture.md
It is better to use this in conjunction with an actual run record (RunMetadata) or use a purely runtime based approach, e.g. Can I measure the execution time of individual operations with TensorFlow?, and do some filtering/aggregation on the results.

Tensorflow: intercept gradients of arbitrary node in the computational graph (not necessarily loss)

I would like to intercept gradients that are backpropagated in my Tensorflow graph, which are not based on the loss (∂L/∂w), but based on some other node in the graph, for example the class scores (∂s/∂w) in a classification problem or some activation (∂a/∂w) to see how it changes when certain weights w change.
How can one implement this efficiently in Tensorflow? Intuitively, the gradients should already all be there for backprop of the loss as intermediate results, so there should be a solution without a big overhead.
I am already aware of the following suggestions, which don't exactly solve the problem:
The Tensorflow method tf.gradients(ys, xs), which computes the gradient for every y in ys w.r.t. every xs, but then, for every x in xs sums over all y. Applying this function for every y in ys separately, however, induces a large computational overhead.
This stackoverflow post, which ask this question for the derivative of the loss w.r.t. some parameters, i.e. ∂L/∂w.
The part of the documentation, which proposes to call optimizer.compute_gradients() as an easy to use 'wrapper' around tf.gradients(). However, calling this function for every variable of interest introduces again a large computational overhead.
Update: Phrased differently, what I want is the Jacobian of any component of the computational graph w.r.t. any other. This topic has been touched in this recent Tensorflow issue, but is described as currently not being efficiently/conveniently implemented therein.

Can TensorFlow cache (sub-)graph computations?

Can TensorFlow automatically cache computations if they involve
multiple calls to the same computation (sub-)graph?
For example, I have a matrix F in which each entry represents a
computation based on trainable variables W. My objective function
multiplies this matrix several times with different vectors (each
time with unchanged W).
Will TensorFlow recompute, for example, F[1,2] whenever I access
it, or will it cache that value?
In theory, one could precompute the matrix F given a fixed W,
such that each entry in F is a tf.constant. But that would
prevent the correct computation of the gradients of W.
TensorFlow performs a limited amount of caching, but it probably doesn't cover the case that you describe.
If you create a tf.Session with the following options, constant folding will be enabled:
config = tf.ConfigProto(graph_options=tf.GraphOptions(
optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L2)))
sess = tf.Session(config=config)
When you call sess.run() with this configuration, TensorFlow will evaluate the appropriate nodes to run, then identify the subgraph of those nodes whose outputs are constant, evaluate them, and cache the results. Therefore, it will avoid re-executing redundant computation.
However, in your question you mention that F is a function of some trainable variables. From TensorFlow's point of view, these variables are volatile—they may change at any time—so it does not cache values that are derived from these variables. If you want to reuse the same value for F multiple times, you could consider storing it in a tf.constant() so that the constant folding optimization is more useful.