Should I use #tf.function for all functions? - tensorflow

An official tutorial on #tf.function says:
To get peak performance and to make your model deployable anywhere,
use tf.function to make graphs out of your programs. Thanks to
AutoGraph, a surprising amount of Python code just works with
tf.function, but there are still pitfalls to be wary of.
The main takeaways and recommendations are:
Don't rely on Python side effects like object mutation or list appends.
tf.function works best with TensorFlow ops, rather than NumPy ops or Python primitives.
When in doubt, use the for x in y idiom.
It only mentions how to implement #tf.function annotated functions but not when to use it.
Is there a heuristic on how to decide whether I should at least try to annotate a function with tf.function? It seems that there are no reasons not to do it, unless I am to lazy to remove side effects or change some things like range()-> tf.range(). But if I am willing to do this...
Is there any reason not to use #tf.function for all functions?

TLDR: It depends on your function and whether you are in production or development. Don't use tf.function if you want to be able to debug your function easily, or if it falls under the limitations of AutoGraph or tf.v1 code compatibility.
I would highly recommend watching the Inside TensorFlow talks about AutoGraph and Functions, not Sessions.
In the following I'll break down the reasons, which are all taken from information made available online by Google.
In general, the tf.function decorator causes a function to be compiled as a callable that executes a TensorFlow graph. This entails:
Conversion of the code through AutoGraph if required (including any functions called from an annotated function)
Tracing and executing the generated graph code
There is detailed information available on the design ideas behind this.
Benefits of decorating a function with tf.function
General benefits
Faster execution, especially if the function consists of many small ops (Source)
For functions with Python code / Using AutoGraph via tf.function decoration
If you want to use AutoGraph, using tf.function is highly recommended over calling AutoGraph directly.
Reasons for this include: Automatic control dependencies, it is required for some APIs, more caching, and exception helpers (Source).
Drawbacks of decorating a function with tf.function
General drawbacks
If the function only consists of few expensive ops, there will not be much speedup (Source)
For functions with Python code / Using AutoGraph via tf.function decoration
No exception catching (should be done in eager mode; outside of the decorated function) (Source)
Debugging is much harder
Limitations due to hidden side effects and TF control flow
Detailed information on AutoGraph limitations is available.
For functions with tf.v1 code
It is not allowed to create variables more than once in tf.function, but this is subject to change as tf.v1 code is phased out (Source)
For functions with tf.v2 code
No specific drawbacks
Examples of limitations
Creating variables more than once
It is not allowed to create variables more than once, such as v in the following example:
#tf.function
def f(x):
v = tf.Variable(1)
return tf.add(x, v)
f(tf.constant(2))
# => ValueError: tf.function-decorated function tried to create variables on non-first call.
In the following code, this is mitigated by making sure that self.v is only created once:
class C(object):
def __init__(self):
self.v = None
#tf.function
def f(self, x):
if self.v is None:
self.v = tf.Variable(1)
return tf.add(x, self.v)
c = C()
print(c.f(tf.constant(2)))
# => tf.Tensor(3, shape=(), dtype=int32)
Hidden side effects not captured by AutoGraph
Changes such as to self.a in this example can't be hidden, which leads to an error since cross-function analysis is not done (yet) (Source):
class C(object):
def change_state(self):
self.a += 1
#tf.function
def f(self):
self.a = tf.constant(0)
if tf.constant(True):
self.change_state() # Mutation of self.a is hidden
tf.print(self.a)
x = C()
x.f()
# => InaccessibleTensorError: The tensor 'Tensor("add:0", shape=(), dtype=int32)' cannot be accessed here: it is defined in another function or code block. Use return values, explicit Python locals or TensorFlow collections to access it. Defined in: FuncGraph(name=cond_true_5, id=5477800528); accessed from: FuncGraph(name=f, id=5476093776).
Changes in plain sight are no problem:
class C(object):
#tf.function
def f(self):
self.a = tf.constant(0)
if tf.constant(True):
self.a += 1 # Mutation of self.a is in plain sight
tf.print(self.a)
x = C()
x.f()
# => 1
Example of limitation due to TF control flow
This if statement leads to an error because the value for else needs to be defined for TF control flow:
#tf.function
def f(a, b):
if tf.greater(a, b):
return tf.constant(1)
# If a <= b would return None
x = f(tf.constant(3), tf.constant(2))
# => ValueError: A value must also be returned from the else branch. If a value is returned from one branch of a conditional a value must be returned from all branches.

tf.function is useful in creating and using computational graphs, they should be used in training and in deployment, however it isnt needed for most of your functions.
Lets say that we are building a special layer that will be apart of a larger model. We would not want to have the tf.function decorator above the function that constructs that layer because it is merely a definition of what the layer will look like.
On the other hand, lets say that we are going to either make a prediction or continue our training using some function. We would want to have the decorator tf.function because we are actually using the computational graph to get some value.
A great example would be constructing a encoder-decoder model.
DONT put the decorator around the function the create the encoder or decoder or any layer, that is only a definition of what it will do.
DO put the decorator around the "train" or "predict" method because those are actually going to use the computational graph for computation.

Per my understanding and according to the documentation, using tf.function is highly recommended mainly for speeding up your code since the code wrapped by tf.function would be converted to a graph and therefore there is a room for some optimizations (e.g. op pruning, folding, etc.) to be done which may not be performed when the same code is run eagerly.
However, there are also a few cases where using tf.function might incur additional overhead or does not result in noticeable speedups. One notable case is when the wrapped function is small and only used a few times in your code and therefore the overhead of calling the graph might be relatively large. Another case is when most of the computations are already done on an accelerator device (e.g. GPU, TPU), and therefore the speedups gained by graph computation might not be significant.
There is also a section in the documentation where the speedups are discussed in various scenarios, and at the beginning of this section the two cases above have been mentioned:
Just wrapping a tensor-using function in tf.function does not automatically speed up your code. For small functions called a few times on a single machine, the overhead of calling a graph or graph fragment may dominate runtime. Also, if most of the computation was already happening on an accelerator, such as stacks of GPU-heavy convolutions, the graph speedup won't be large.
For complicated computations, graphs can provide a significant speedup. This is because graphs reduce the Python-to-device communication and perform some speedups.
But at the end of the day, if it's applicable to your workflow, I think the best way to determine this for your specific use case and environment is to profile your code when it gets executed in eager mode (i.e. without using tf.function) vs. when it gets executed in graph mode (i.e. using tf.function extensively).

Related

Is there a PyTorch equivalent of tf.custom_gradient()?

I am new to PyTorch but have a lot of experience with TensorFlow.
I would like to modify the gradient of just a tiny piece of the graph: just the derivative of activation function of a single layer. This can be easily done in Tensorflow using tf.custom_gradient, which allows you to supply customized gradient for any functions.
I would like to do the same thing in PyTorch and I know that you can modify the backward() method, but that requires you to rewrite the derivative for the whole network defined in the forward() method, when I would just like to modify the gradient of a tiny piece of the graph. Is there something like tf.custom_gradient() in PyTorch? Thanks!
You can do this in two ways:
1. Modifying the backward() function:
As you already said in your question, pytorch also allows you to provide a custom backward implementation. However, in contrast to what you wrote, you do not need to re-write the backward() of the entire model - only the backward() of the specific layer you want to change.
Here's a simple and nice tutorial that shows how this can be done.
For example, here is a custom clip activation that instead of killing the gradients outside the [0, 1] domain, simply passes the gradients as-is:
class MyClip(torch.autograd.Function):
#staticmethod
def forward(ctx, x):
return torch.clip(x, 0., 1.)
#staticmethod
def backward(ctx, grad):
return grad
Now you can use MyClip layer wherever you like in your model and you do not need to worry about the overall backward function.
2. Using a backward hook
pytorch allows you to attach hooks to different layer (=sub nn.Modules) of your network. You can register_full_backward_hook to your layer. That hook function can modify the gradients:
The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations.

How to initialize the model with certain weights?

I am using the example "stateful_clients" in tensorflow-federated examples. I want to use my pretrained model weights to initialize the model. I use the function model.load_weights(init_weight). But it seems that it doesn't work. The validation accuracy in the first round is still low. How can I solve the problem?
def tff_model_fn():
"""Constructs a fully initialized model for use in federated averaging."""
keras_model = get_five_layers_cnn([28, 28, 1])
keras_model.load_weights(init_weight)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
return stateful_fedavg_tf.KerasModelWrapper(keras_model,
test_data.element_spec, loss)
A quick primer on state and model weights in TFF
TFF takes a distinct perspective on state in machine learning, generally a consequence of its desire to be purely functional.
Usually in machine learning, a model is conceptually a function which takes data and produces a prediction. However, this notion is a little overloaded at times; does 'model' refer to a trained model (fitting the specification above), or an architecture which is parameterized by its parameters, and therefore needs to accept these parameters as an argument to be considered truly a 'function'? A conception somewhat in the middle is that of a 'stateful function', which I think tends to be what people intend to refer to when they use the term 'model'.
TFF standardizes on the latter understanding. For TFF, a 'model' is a function which accepts parameters along with data as an argument, producing a prediction. This is generally to avoid the notion of a stateful function, which is disallowed by a purely functional perspective (f(x) == f(x) should always be true, so f cannot have any state which affects its output).
On the code in question
I'm not super familiar with this portion of the TFF codebase; in particular I'm a little surprised at the behavior of the keras model wrapper, as usually TFF wants to serialize all logic into TFF-defined data structures as soon as possible (at least, this is how I think about it). Glancing at the code, it looks to me like it could work--but there have been exciting interactions between TFF and Keras in the past.
Briefly, here is how this path should be working:
The model function you define above is invoked while building the initialize computation, in a graph context; the logic to load weights (or assignment of the weights themselves, baked into the graph as a constant) would hopefully be serialized into the graph that TFF generates to represent initialize.
Upon calling iterative_process.initialize, you would find your desired weights populated in the appropriate attributes of the returned data structure. This would serve as your initial starting point for your iterative process, and you would be off to the races.
What I am suspicious of in the above is 1. TFF will silently invoke your model_fn in a TensorFlow graph context, resulting in non program-order semantics; if there is no control dependency between the assignment and the return value of your function (which there isn't in the code above, and in fact it is not obvious how to force this), the assignment may be skipped at initialize time. Therefore the state returned from initialize won't have your specified weights.
If this suspicion is true, the appropriate solution is to run this to run the weight loading logic directly in Python. TFF provides some utilities to help with this kind of thing, like tff.learning.state_with_new_model_weights. This would be used like:
state = iterative_process.initialize()
weights = tf.keras.load_weights(...) # No idea if this call is correct, probably not.
state_with_loaded_weights = tff.learning.state_with_new_model_weights(state, weights)
...
# continue on using state in the iterative process

optimization in gpflow 2: Why set autograph=False?

in the current notebook tutorials (gpflow 2.0), all #tf.function tags include the option
autograph=False, e.g. (https://gpflow.readthedocs.io/en/2.0.0-rc1/notebooks/advanced/gps_for_big_data.html):
#tf.function(autograph=False)
def optimization_step(optimizer, model: gpflow.models.SVGP, batch):
with tf.GradientTape(watch_accessed_variables=False) as tape:
tape.watch(model.trainable_variables)
objective = - model.elbo(*batch)
grads = tape.gradient(objective, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return objective
Does anyone know why that is the case, or what the reasoning behind this is?
As far as I understood, autograph=True simply allows for python control flow to be translated to a graph structure. Does setting/leaving it to true, even if the functionality is not required, have any drawbacks?
My guess would have been that its just a small overhead at compile time of the graph, but should be negligible. Is that wrong?
Thanks
The reason we set autograph to False in most of the tf.function wrapped objectives is because GPflow makes use a multi-dispatch Dispatcher which internally uses generators. TensorFlow, however, can not deal with generator objects in autograph mode (see Capabilities and Limitations of AutoGraph), which leads to these warning:
WARNING:tensorflow:Entity <bound method Dispatcher.dispatch_iter of <dispatched sample_conditional>> appears to be a generator function. It will not be converted by AutoGraph.
WARNING: Entity <bound method Dispatcher.dispatch_iter of <dispatched sample_conditional>> appears to be a generator function. It will not be converted by AutoGraph.
WARNING:tensorflow:Entity <bound method Dispatcher.dispatch_iter of <dispatched conditional>> appears to be a generator function. It will not be converted by AutoGraph.
WARNING: Entity <bound method Dispatcher.dispatch_iter of <dispatched conditional>> appears to be a generator function. It will not be converted by AutoGraph.
We've know about this issue for a while but haven't got around to actually fixing it - thanks for bringing this back to our attention. I've just created a PR which fixes this issue and does not require you to set autograph to False anymore. I expect this PR to be merged fairly soon.

Is it a TensorFlow Best Practice for loss functions be callable (in the form of a function)? Other advantages besides Eager Execution compatibility?

Eager Execution requires any loss passed to any optimizer to be callabe, ie, in the form of a function.
So this is OK
def loss_function():
return tf.reduce_mean( tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
but this is NOT ok
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
And will raise this error
`loss` passed to Optimizer.compute_gradients should be a function when eager execution is enabled.
I've noticed that usually when a new Tensorflow feature requires some sort of practice, there are usually multiple benefits associated with that requirement even when not using that feature. For example, Eager execution also requires get_variable, to define variables. And as far as I can tell, there is no reason to use variable over get_variable
So are there any other advantages of having a loss function be callable outside of using Eager Execution?
On the loss issue (Akshay summed up the variable issue in the question comments):
When graph building, Tensors can behave like futures, and this is the way the Optimizer methods use it. Then this future can be evaluated many times in the training loop and the evaluation monitored to produce the gradient (which works by adding a bunch of ops to the graph which look at intermediate values).
Tensors are literal values when executing eagerly, so to get the same behavior it needs to be wrapped in a function (which is then run with a tf.GradientTape active to trace the gradient).
As for whether it's a best practice, new TensorFlow code being graph/eager agnostic is a good thing, although writing agnostic training loops is often more effort than it's worth. In general using Tensors as futures can be a bit harder to reason about.

When should I define a new TensorFlow op?

For my application, I was able to create a new function using only predefined ops. Is there any need to define a new op in this case?
The pseudocode for my function is:
z1 = myGauss(arg, arg2)
def myGauss(arg, arg2):
# Here I only used defined tensorflow operations
If you can achieve what you set out to do with a composition of existing ops, then that's great! You don't need to create a new op.
There are circumstances when we've found it necessary to create a new op, however:
Sometimes you can gain performance by fusing ops together into a single op. For example many of the "training" ops have fused implementations, even though they were initially implemented using simple ops.
Another example is when you want to define a gradient for a composition of ops (because it's more efficient or stable to consider the expression as a whole). This is the rationale for ops like tf.nn.softmax_cross_entropy_with_logits().