What caching model does TensorFlow use? - tensorflow

I read the question here
TensorFlow - get current value of a Variable
and the answer has left me confused.
On one hand, dga says "And to be very clear: Running the variable will
produce only the current value of the variable; it will not run any
assign operations associated with it. It's cheap."
On the other hand, Salvador Dali says "#dga yes, if the variable depends
on n other variables, they also need to be evaluated."
So, which is it? Does evaluating the variable only return its current
value, or does it recompute its value from scratch from the variables it
depends on?
What happens if I evaluate the same variable twice in a row? Does
Tensorflow have any notion of "stale" variables, i.e. variables that
need to be recomputed because their dependencies actually changed (i.e. like in
build system)?
I ask because I work with multiple nets where the partial output of one
net becomes the partial input of another net. I want to fetch the
gradients computed at the input layer of one net and merge+apply them to
the output layer of another net. I was hoping to do this by manually
retrieving/storing gradients in the variables of a graph, and then
running graph operations to backpropagate the gradients. Thus I need to
understand how it all works under the hood.
What I do is similar to this
How to use Tensorflow Optimizer without recomputing activations in reinforcement learning program that returns control after each iteration?, but I can't conclude whether it's possible based on the last answer (experimental support now in?)
Thanks!

#dga is correct. If you pass a tf.Variable object to tf.Session.run() TensorFlow will return the current value of the variable, and it will not perform any computation. It is cheap (the cost of a memory copy, or possibly a network transfer in the case of a distributed TensorFlow setup). TensorFlow does not retain any history* about how the value of a tf.Variable was updated, so it cannot in general recompute its value from scratch.
(* Technically TensorFlow remembers the tf.Tensor that was used to initialize each variable, so it is possible to recompute the inital value of the variable.)

Related

How to initialize the model with certain weights?

I am using the example "stateful_clients" in tensorflow-federated examples. I want to use my pretrained model weights to initialize the model. I use the function model.load_weights(init_weight). But it seems that it doesn't work. The validation accuracy in the first round is still low. How can I solve the problem?
def tff_model_fn():
"""Constructs a fully initialized model for use in federated averaging."""
keras_model = get_five_layers_cnn([28, 28, 1])
keras_model.load_weights(init_weight)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
return stateful_fedavg_tf.KerasModelWrapper(keras_model,
test_data.element_spec, loss)
A quick primer on state and model weights in TFF
TFF takes a distinct perspective on state in machine learning, generally a consequence of its desire to be purely functional.
Usually in machine learning, a model is conceptually a function which takes data and produces a prediction. However, this notion is a little overloaded at times; does 'model' refer to a trained model (fitting the specification above), or an architecture which is parameterized by its parameters, and therefore needs to accept these parameters as an argument to be considered truly a 'function'? A conception somewhat in the middle is that of a 'stateful function', which I think tends to be what people intend to refer to when they use the term 'model'.
TFF standardizes on the latter understanding. For TFF, a 'model' is a function which accepts parameters along with data as an argument, producing a prediction. This is generally to avoid the notion of a stateful function, which is disallowed by a purely functional perspective (f(x) == f(x) should always be true, so f cannot have any state which affects its output).
On the code in question
I'm not super familiar with this portion of the TFF codebase; in particular I'm a little surprised at the behavior of the keras model wrapper, as usually TFF wants to serialize all logic into TFF-defined data structures as soon as possible (at least, this is how I think about it). Glancing at the code, it looks to me like it could work--but there have been exciting interactions between TFF and Keras in the past.
Briefly, here is how this path should be working:
The model function you define above is invoked while building the initialize computation, in a graph context; the logic to load weights (or assignment of the weights themselves, baked into the graph as a constant) would hopefully be serialized into the graph that TFF generates to represent initialize.
Upon calling iterative_process.initialize, you would find your desired weights populated in the appropriate attributes of the returned data structure. This would serve as your initial starting point for your iterative process, and you would be off to the races.
What I am suspicious of in the above is 1. TFF will silently invoke your model_fn in a TensorFlow graph context, resulting in non program-order semantics; if there is no control dependency between the assignment and the return value of your function (which there isn't in the code above, and in fact it is not obvious how to force this), the assignment may be skipped at initialize time. Therefore the state returned from initialize won't have your specified weights.
If this suspicion is true, the appropriate solution is to run this to run the weight loading logic directly in Python. TFF provides some utilities to help with this kind of thing, like tff.learning.state_with_new_model_weights. This would be used like:
state = iterative_process.initialize()
weights = tf.keras.load_weights(...) # No idea if this call is correct, probably not.
state_with_loaded_weights = tff.learning.state_with_new_model_weights(state, weights)
...
# continue on using state in the iterative process

How to get the value of loss in the update rule of chainer

I am trying to modify a class, SGDRule(optimizer.UpdateRule) of chainer, to make my original optimizer.
To achieve what I want, I need to get not only the gradient but also the loss.
Before generating the gradient by back propagation, a forward path, which yields the loss, must be done. I need that loss.
The problem is that I have to access the loss from the code of update_core_gpu(self, param) in the class.
I learned that the Classifier object has the loss as an attribute. However, I don't know how to access the object from the update rule.
As an alternative, I considered using the Reporter object that I can access from the code. I know how to pass a value to the reporter, but have no idea about how to get the loss that the reporter has.
Does anybody know how to get the current loss in the code of update rule?
If you are using a model that holds the loss, e.g. a Classifier, one simple but maybe less elegant way to do it would be to simply pass the model to the Optimizer and then to each UpdateRule when being constructed in Optimizer.create_update_rule. If you don't want to pass the model, you could probably pass a lambda that returns the loss from the model.
Another, probably a cleaner approach if sufficient for your case, would be to implement an optimizer hook, similar to how gradient clipping is implemented in Chainer. See https://github.com/chainer/chainer/blob/master/chainer/optimizer_hooks/gradient_clipping.py#L56. You can obtain the loss via opt.target.loss (opt.target being
your model) and for instance update the gradient prior to the optimization step.

Is there a way to measure the back-ward pass of a model?

There is a relevant question here already TensorFlow: Is there a way to measure FLOPS for a model?
However, the answer given by #Tobias Scheck is the forward pass stats.
Is there a way to measure/estimate the backward pass as well?
If you just want to get a quick number, you can simply add
grads = tf.gradients(C, [A, B])
to #Tobias Scheck's code to construct the gradient computation nodes. Then, subtract the new number (with gradient ops) from the original one (without gradient ops) to get the estimated flops.
A word of caution about using this method in larger projects. This method uses static analysis of the whole graph. This has a few problems including:
The flops from ops in a while loop will be added only once.
Ops that are never normally run (some TF functionalities can leave garbage ops in the graph) will be added.
This analysis heavily depends on shape inference. It might not be available for all ops.
This analysis depends on registering functions that can estimate the flops of a given op. There can be ops without such functions and such functions don't precisely model the flops done by the actual kernel your TF will pick to execute the op.
For more info see: https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/core/profiler/g3doc/profile_model_architecture.md
It is better to use this in conjunction with an actual run record (RunMetadata) or use a purely runtime based approach, e.g. Can I measure the execution time of individual operations with TensorFlow?, and do some filtering/aggregation on the results.

What does tf.train.get_global_step() do in TensorFlow?

What is the use of the function tf.train.get_global_step() in TensorFlow?
In machine learning concepts what is it equivalent to?
You could use it to restart training exactly where you left off when the training procedure has been stopped for some reason. Of course you can always restart training without knowing the global_step (if you save checkpoints regularly in your code, that is), but unless you somehow keep track of how many iterations you already performed, you will not know how many iterations are left after the restart. Sometimes you really want your model to be trained exactly n iterations and not n plus unknown amount before crash. So in my opinion, this is more of a practicality than a theoretical machine learning concept.
tf.train.get_global_step() return global step(variable, tensor from variable node or None) through get_collection(tf.GraphKeys.GLOBAL_STEP) or get_tensor_by_name('global_step:0')
global step is widely used in learn rate decay(like tf.train.exponential_decay, see Decaying the learning rate for more information).
You can pass global step to optimzer apply_gradients or minimize method to increment by one.
while you defined the global step operator, you can get value of it by sess.run(global_step_op)

Caching Computations in TensorFlow

Is there a canonical way to reuse computations from a previously-supplied placeholder in TensorFlow? My specific use case:
supply many inputs (using one placeholder) simultaneously, all of which are fed through a network to obtain smaller representations
define a loss based on various combinations of these smaller representations
train on one batch at a time, where each batch uses some subset of the inputs, without recomputing the smaller representations
Here is the goal in code, but which is defective because the same computations are carried out again and again:
X_in = some_fixed_data
combinations_in = large_set_of_combination_indices
for combination_batch_in in batches(combinations_in, batch_size=128):
session.run(train_op, feed_dict={X: X_in, combinations: combination_batch_in})
Thanks.
The canonical way to share computed values across sess.Run() calls is to use a Variable. In this case, you could set up your graph so that when the Placeholders are fed, they compute a new value of the representation that is saved into a Variable. A separate portion of the graph reads those Variables to compute the loss. This will not work if you need to compute gradients through the part of the graph that computes the representation. Computing those gradients will require recomputing every Op in the encoder.
This is the kind of thing that should be solved automatically with CSE (common subexpression elimination). Not sure what the support in TensorFlow right now, might be kind of spotty, but there's optimizer_do_cse flag for Graph options which is defaulting to false, and you can set it to true using GraphConstructorOptions. Here's a C++ example of using GraphConstructorOptions (sorry, couldn't find a Python one)
If that doesn't work, you could do "manual CSE", ie, figure out which part is being needlessly recomputed, factor it out into separate Tensor, and reference that tensor in all the calculations.