optimization in gpflow 2: Why set autograph=False? - tensorflow

in the current notebook tutorials (gpflow 2.0), all #tf.function tags include the option
autograph=False, e.g. (https://gpflow.readthedocs.io/en/2.0.0-rc1/notebooks/advanced/gps_for_big_data.html):
#tf.function(autograph=False)
def optimization_step(optimizer, model: gpflow.models.SVGP, batch):
with tf.GradientTape(watch_accessed_variables=False) as tape:
tape.watch(model.trainable_variables)
objective = - model.elbo(*batch)
grads = tape.gradient(objective, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return objective
Does anyone know why that is the case, or what the reasoning behind this is?
As far as I understood, autograph=True simply allows for python control flow to be translated to a graph structure. Does setting/leaving it to true, even if the functionality is not required, have any drawbacks?
My guess would have been that its just a small overhead at compile time of the graph, but should be negligible. Is that wrong?
Thanks

The reason we set autograph to False in most of the tf.function wrapped objectives is because GPflow makes use a multi-dispatch Dispatcher which internally uses generators. TensorFlow, however, can not deal with generator objects in autograph mode (see Capabilities and Limitations of AutoGraph), which leads to these warning:
WARNING:tensorflow:Entity <bound method Dispatcher.dispatch_iter of <dispatched sample_conditional>> appears to be a generator function. It will not be converted by AutoGraph.
WARNING: Entity <bound method Dispatcher.dispatch_iter of <dispatched sample_conditional>> appears to be a generator function. It will not be converted by AutoGraph.
WARNING:tensorflow:Entity <bound method Dispatcher.dispatch_iter of <dispatched conditional>> appears to be a generator function. It will not be converted by AutoGraph.
WARNING: Entity <bound method Dispatcher.dispatch_iter of <dispatched conditional>> appears to be a generator function. It will not be converted by AutoGraph.
We've know about this issue for a while but haven't got around to actually fixing it - thanks for bringing this back to our attention. I've just created a PR which fixes this issue and does not require you to set autograph to False anymore. I expect this PR to be merged fairly soon.

Related

Running models eagerly in Keras

I have noticed that even after compiling the model with .compile(..., run_eagerly=False), the print statements in .call() keep working. Does that mean that .call() is to be manually wrapped in a tf.function?
run_eagerly in compile() influence the attr tf.keras.Model.run_eagerly, but not used in call(). Because, call() is NotImplemented in tf.keras.Model, its used for custom training mode, not for "comple() fit()" coding mode. You can find the tf.function wrap logic in
tf.keras.Model.make_train_function, which used by fit() automatically to wrap train_step() etc., which are higher level logics than call().

How to initialize the model with certain weights?

I am using the example "stateful_clients" in tensorflow-federated examples. I want to use my pretrained model weights to initialize the model. I use the function model.load_weights(init_weight). But it seems that it doesn't work. The validation accuracy in the first round is still low. How can I solve the problem?
def tff_model_fn():
"""Constructs a fully initialized model for use in federated averaging."""
keras_model = get_five_layers_cnn([28, 28, 1])
keras_model.load_weights(init_weight)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
return stateful_fedavg_tf.KerasModelWrapper(keras_model,
test_data.element_spec, loss)
A quick primer on state and model weights in TFF
TFF takes a distinct perspective on state in machine learning, generally a consequence of its desire to be purely functional.
Usually in machine learning, a model is conceptually a function which takes data and produces a prediction. However, this notion is a little overloaded at times; does 'model' refer to a trained model (fitting the specification above), or an architecture which is parameterized by its parameters, and therefore needs to accept these parameters as an argument to be considered truly a 'function'? A conception somewhat in the middle is that of a 'stateful function', which I think tends to be what people intend to refer to when they use the term 'model'.
TFF standardizes on the latter understanding. For TFF, a 'model' is a function which accepts parameters along with data as an argument, producing a prediction. This is generally to avoid the notion of a stateful function, which is disallowed by a purely functional perspective (f(x) == f(x) should always be true, so f cannot have any state which affects its output).
On the code in question
I'm not super familiar with this portion of the TFF codebase; in particular I'm a little surprised at the behavior of the keras model wrapper, as usually TFF wants to serialize all logic into TFF-defined data structures as soon as possible (at least, this is how I think about it). Glancing at the code, it looks to me like it could work--but there have been exciting interactions between TFF and Keras in the past.
Briefly, here is how this path should be working:
The model function you define above is invoked while building the initialize computation, in a graph context; the logic to load weights (or assignment of the weights themselves, baked into the graph as a constant) would hopefully be serialized into the graph that TFF generates to represent initialize.
Upon calling iterative_process.initialize, you would find your desired weights populated in the appropriate attributes of the returned data structure. This would serve as your initial starting point for your iterative process, and you would be off to the races.
What I am suspicious of in the above is 1. TFF will silently invoke your model_fn in a TensorFlow graph context, resulting in non program-order semantics; if there is no control dependency between the assignment and the return value of your function (which there isn't in the code above, and in fact it is not obvious how to force this), the assignment may be skipped at initialize time. Therefore the state returned from initialize won't have your specified weights.
If this suspicion is true, the appropriate solution is to run this to run the weight loading logic directly in Python. TFF provides some utilities to help with this kind of thing, like tff.learning.state_with_new_model_weights. This would be used like:
state = iterative_process.initialize()
weights = tf.keras.load_weights(...) # No idea if this call is correct, probably not.
state_with_loaded_weights = tff.learning.state_with_new_model_weights(state, weights)
...
# continue on using state in the iterative process

Should I use #tf.function for all functions?

An official tutorial on #tf.function says:
To get peak performance and to make your model deployable anywhere,
use tf.function to make graphs out of your programs. Thanks to
AutoGraph, a surprising amount of Python code just works with
tf.function, but there are still pitfalls to be wary of.
The main takeaways and recommendations are:
Don't rely on Python side effects like object mutation or list appends.
tf.function works best with TensorFlow ops, rather than NumPy ops or Python primitives.
When in doubt, use the for x in y idiom.
It only mentions how to implement #tf.function annotated functions but not when to use it.
Is there a heuristic on how to decide whether I should at least try to annotate a function with tf.function? It seems that there are no reasons not to do it, unless I am to lazy to remove side effects or change some things like range()-> tf.range(). But if I am willing to do this...
Is there any reason not to use #tf.function for all functions?
TLDR: It depends on your function and whether you are in production or development. Don't use tf.function if you want to be able to debug your function easily, or if it falls under the limitations of AutoGraph or tf.v1 code compatibility.
I would highly recommend watching the Inside TensorFlow talks about AutoGraph and Functions, not Sessions.
In the following I'll break down the reasons, which are all taken from information made available online by Google.
In general, the tf.function decorator causes a function to be compiled as a callable that executes a TensorFlow graph. This entails:
Conversion of the code through AutoGraph if required (including any functions called from an annotated function)
Tracing and executing the generated graph code
There is detailed information available on the design ideas behind this.
Benefits of decorating a function with tf.function
General benefits
Faster execution, especially if the function consists of many small ops (Source)
For functions with Python code / Using AutoGraph via tf.function decoration
If you want to use AutoGraph, using tf.function is highly recommended over calling AutoGraph directly.
Reasons for this include: Automatic control dependencies, it is required for some APIs, more caching, and exception helpers (Source).
Drawbacks of decorating a function with tf.function
General drawbacks
If the function only consists of few expensive ops, there will not be much speedup (Source)
For functions with Python code / Using AutoGraph via tf.function decoration
No exception catching (should be done in eager mode; outside of the decorated function) (Source)
Debugging is much harder
Limitations due to hidden side effects and TF control flow
Detailed information on AutoGraph limitations is available.
For functions with tf.v1 code
It is not allowed to create variables more than once in tf.function, but this is subject to change as tf.v1 code is phased out (Source)
For functions with tf.v2 code
No specific drawbacks
Examples of limitations
Creating variables more than once
It is not allowed to create variables more than once, such as v in the following example:
#tf.function
def f(x):
v = tf.Variable(1)
return tf.add(x, v)
f(tf.constant(2))
# => ValueError: tf.function-decorated function tried to create variables on non-first call.
In the following code, this is mitigated by making sure that self.v is only created once:
class C(object):
def __init__(self):
self.v = None
#tf.function
def f(self, x):
if self.v is None:
self.v = tf.Variable(1)
return tf.add(x, self.v)
c = C()
print(c.f(tf.constant(2)))
# => tf.Tensor(3, shape=(), dtype=int32)
Hidden side effects not captured by AutoGraph
Changes such as to self.a in this example can't be hidden, which leads to an error since cross-function analysis is not done (yet) (Source):
class C(object):
def change_state(self):
self.a += 1
#tf.function
def f(self):
self.a = tf.constant(0)
if tf.constant(True):
self.change_state() # Mutation of self.a is hidden
tf.print(self.a)
x = C()
x.f()
# => InaccessibleTensorError: The tensor 'Tensor("add:0", shape=(), dtype=int32)' cannot be accessed here: it is defined in another function or code block. Use return values, explicit Python locals or TensorFlow collections to access it. Defined in: FuncGraph(name=cond_true_5, id=5477800528); accessed from: FuncGraph(name=f, id=5476093776).
Changes in plain sight are no problem:
class C(object):
#tf.function
def f(self):
self.a = tf.constant(0)
if tf.constant(True):
self.a += 1 # Mutation of self.a is in plain sight
tf.print(self.a)
x = C()
x.f()
# => 1
Example of limitation due to TF control flow
This if statement leads to an error because the value for else needs to be defined for TF control flow:
#tf.function
def f(a, b):
if tf.greater(a, b):
return tf.constant(1)
# If a <= b would return None
x = f(tf.constant(3), tf.constant(2))
# => ValueError: A value must also be returned from the else branch. If a value is returned from one branch of a conditional a value must be returned from all branches.
tf.function is useful in creating and using computational graphs, they should be used in training and in deployment, however it isnt needed for most of your functions.
Lets say that we are building a special layer that will be apart of a larger model. We would not want to have the tf.function decorator above the function that constructs that layer because it is merely a definition of what the layer will look like.
On the other hand, lets say that we are going to either make a prediction or continue our training using some function. We would want to have the decorator tf.function because we are actually using the computational graph to get some value.
A great example would be constructing a encoder-decoder model.
DONT put the decorator around the function the create the encoder or decoder or any layer, that is only a definition of what it will do.
DO put the decorator around the "train" or "predict" method because those are actually going to use the computational graph for computation.
Per my understanding and according to the documentation, using tf.function is highly recommended mainly for speeding up your code since the code wrapped by tf.function would be converted to a graph and therefore there is a room for some optimizations (e.g. op pruning, folding, etc.) to be done which may not be performed when the same code is run eagerly.
However, there are also a few cases where using tf.function might incur additional overhead or does not result in noticeable speedups. One notable case is when the wrapped function is small and only used a few times in your code and therefore the overhead of calling the graph might be relatively large. Another case is when most of the computations are already done on an accelerator device (e.g. GPU, TPU), and therefore the speedups gained by graph computation might not be significant.
There is also a section in the documentation where the speedups are discussed in various scenarios, and at the beginning of this section the two cases above have been mentioned:
Just wrapping a tensor-using function in tf.function does not automatically speed up your code. For small functions called a few times on a single machine, the overhead of calling a graph or graph fragment may dominate runtime. Also, if most of the computation was already happening on an accelerator, such as stacks of GPU-heavy convolutions, the graph speedup won't be large.
For complicated computations, graphs can provide a significant speedup. This is because graphs reduce the Python-to-device communication and perform some speedups.
But at the end of the day, if it's applicable to your workflow, I think the best way to determine this for your specific use case and environment is to profile your code when it gets executed in eager mode (i.e. without using tf.function) vs. when it gets executed in graph mode (i.e. using tf.function extensively).

difficulty with imshow() numpy() and eager execution in tf2.0

I'm running tf2.0 in a conda environment, and would like to display a tensor in a figure.
plt.imshow(tmp)
TypeError: Image data of dtype object cannot be converted to float
tmp.dtype
tf.float32
So I tried converting it to a numpy array, but...
print(tmp.numpy())
AttributeError: 'Tensor' object has no attribute 'numpy'
tmp.eval()
ValueError: Cannot evaluate tensor using `eval()`: No default session is registered. Use `with sess.as_default()` or pass an explicit session to `eval(session=sess)`
I've read elsewhere that this is because I need an active session or eager execution. Eager execution should be enabled by default in tf2.0, but...
print(tf.__version__)
2.0.0-alpha0
tf.executing_eagerly()
False
tf.enable_eager_execution()
AttributeError: module 'tensorflow' has no attribute 'enable_eager_execution'
tf.compat.v1.enable_eager_execution()
None
tf.executing_eagerly()
False
sess = tf.Session()
AttributeError: module 'tensorflow' has no attribute 'Session'
I tried upgrading to 2.0.0b1, but the results were exactly the same (except tf.__version__).
Edit:
according to this answer, the problems are probably because I am trying to debug a function which is inside a tf.data.Dataset.map() call, which work with static graphs. So perhaps the question becomes "how do I debug these functions?"
The critical insight for me was that running the tf.data.Dataset.map() function builds a graph, and the graph is executed later as part of a data pipeline. So it is more about code generation, and eager execution doesn't apply. Besides the lack of eager execution, building a graph has other restrictions, including that all inputs and outputs must be tensors. Tensors don't support item assignment operations such as T[0] += 1.
Item assignment is a fairly common use case, so there is a straightforward solution: tf.py_function (previously tf.py_func). py_function works with numpy arrays as inputs and outputs, so you're free to make use of other numpy functions which have not yet been included in the tensorflow library.
As usual, there is a trade-off: a py_function is interpreted on the fly by the python interpreter. So it won't be as fast as pre-compiled tensor operations. More importantly, the interpreter threads are not aware of each other, so there may be parallelisation issues.
There's a helpful explanation and demonstration of a py_function in the documentation: https://www.tensorflow.org/beta/guide/data

Is it a TensorFlow Best Practice for loss functions be callable (in the form of a function)? Other advantages besides Eager Execution compatibility?

Eager Execution requires any loss passed to any optimizer to be callabe, ie, in the form of a function.
So this is OK
def loss_function():
return tf.reduce_mean( tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
but this is NOT ok
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
And will raise this error
`loss` passed to Optimizer.compute_gradients should be a function when eager execution is enabled.
I've noticed that usually when a new Tensorflow feature requires some sort of practice, there are usually multiple benefits associated with that requirement even when not using that feature. For example, Eager execution also requires get_variable, to define variables. And as far as I can tell, there is no reason to use variable over get_variable
So are there any other advantages of having a loss function be callable outside of using Eager Execution?
On the loss issue (Akshay summed up the variable issue in the question comments):
When graph building, Tensors can behave like futures, and this is the way the Optimizer methods use it. Then this future can be evaluated many times in the training loop and the evaluation monitored to produce the gradient (which works by adding a bunch of ops to the graph which look at intermediate values).
Tensors are literal values when executing eagerly, so to get the same behavior it needs to be wrapped in a function (which is then run with a tf.GradientTape active to trace the gradient).
As for whether it's a best practice, new TensorFlow code being graph/eager agnostic is a good thing, although writing agnostic training loops is often more effort than it's worth. In general using Tensors as futures can be a bit harder to reason about.