How to initialize the model with certain weights? - tensorflow2.0

I am using the example "stateful_clients" in tensorflow-federated examples. I want to use my pretrained model weights to initialize the model. I use the function model.load_weights(init_weight). But it seems that it doesn't work. The validation accuracy in the first round is still low. How can I solve the problem?
def tff_model_fn():
"""Constructs a fully initialized model for use in federated averaging."""
keras_model = get_five_layers_cnn([28, 28, 1])
keras_model.load_weights(init_weight)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
return stateful_fedavg_tf.KerasModelWrapper(keras_model,
test_data.element_spec, loss)

A quick primer on state and model weights in TFF
TFF takes a distinct perspective on state in machine learning, generally a consequence of its desire to be purely functional.
Usually in machine learning, a model is conceptually a function which takes data and produces a prediction. However, this notion is a little overloaded at times; does 'model' refer to a trained model (fitting the specification above), or an architecture which is parameterized by its parameters, and therefore needs to accept these parameters as an argument to be considered truly a 'function'? A conception somewhat in the middle is that of a 'stateful function', which I think tends to be what people intend to refer to when they use the term 'model'.
TFF standardizes on the latter understanding. For TFF, a 'model' is a function which accepts parameters along with data as an argument, producing a prediction. This is generally to avoid the notion of a stateful function, which is disallowed by a purely functional perspective (f(x) == f(x) should always be true, so f cannot have any state which affects its output).
On the code in question
I'm not super familiar with this portion of the TFF codebase; in particular I'm a little surprised at the behavior of the keras model wrapper, as usually TFF wants to serialize all logic into TFF-defined data structures as soon as possible (at least, this is how I think about it). Glancing at the code, it looks to me like it could work--but there have been exciting interactions between TFF and Keras in the past.
Briefly, here is how this path should be working:
The model function you define above is invoked while building the initialize computation, in a graph context; the logic to load weights (or assignment of the weights themselves, baked into the graph as a constant) would hopefully be serialized into the graph that TFF generates to represent initialize.
Upon calling iterative_process.initialize, you would find your desired weights populated in the appropriate attributes of the returned data structure. This would serve as your initial starting point for your iterative process, and you would be off to the races.
What I am suspicious of in the above is 1. TFF will silently invoke your model_fn in a TensorFlow graph context, resulting in non program-order semantics; if there is no control dependency between the assignment and the return value of your function (which there isn't in the code above, and in fact it is not obvious how to force this), the assignment may be skipped at initialize time. Therefore the state returned from initialize won't have your specified weights.
If this suspicion is true, the appropriate solution is to run this to run the weight loading logic directly in Python. TFF provides some utilities to help with this kind of thing, like tff.learning.state_with_new_model_weights. This would be used like:
state = iterative_process.initialize()
weights = tf.keras.load_weights(...) # No idea if this call is correct, probably not.
state_with_loaded_weights = tff.learning.state_with_new_model_weights(state, weights)
...
# continue on using state in the iterative process

Related

Should I use #tf.function for all functions?

An official tutorial on #tf.function says:
To get peak performance and to make your model deployable anywhere,
use tf.function to make graphs out of your programs. Thanks to
AutoGraph, a surprising amount of Python code just works with
tf.function, but there are still pitfalls to be wary of.
The main takeaways and recommendations are:
Don't rely on Python side effects like object mutation or list appends.
tf.function works best with TensorFlow ops, rather than NumPy ops or Python primitives.
When in doubt, use the for x in y idiom.
It only mentions how to implement #tf.function annotated functions but not when to use it.
Is there a heuristic on how to decide whether I should at least try to annotate a function with tf.function? It seems that there are no reasons not to do it, unless I am to lazy to remove side effects or change some things like range()-> tf.range(). But if I am willing to do this...
Is there any reason not to use #tf.function for all functions?
TLDR: It depends on your function and whether you are in production or development. Don't use tf.function if you want to be able to debug your function easily, or if it falls under the limitations of AutoGraph or tf.v1 code compatibility.
I would highly recommend watching the Inside TensorFlow talks about AutoGraph and Functions, not Sessions.
In the following I'll break down the reasons, which are all taken from information made available online by Google.
In general, the tf.function decorator causes a function to be compiled as a callable that executes a TensorFlow graph. This entails:
Conversion of the code through AutoGraph if required (including any functions called from an annotated function)
Tracing and executing the generated graph code
There is detailed information available on the design ideas behind this.
Benefits of decorating a function with tf.function
General benefits
Faster execution, especially if the function consists of many small ops (Source)
For functions with Python code / Using AutoGraph via tf.function decoration
If you want to use AutoGraph, using tf.function is highly recommended over calling AutoGraph directly.
Reasons for this include: Automatic control dependencies, it is required for some APIs, more caching, and exception helpers (Source).
Drawbacks of decorating a function with tf.function
General drawbacks
If the function only consists of few expensive ops, there will not be much speedup (Source)
For functions with Python code / Using AutoGraph via tf.function decoration
No exception catching (should be done in eager mode; outside of the decorated function) (Source)
Debugging is much harder
Limitations due to hidden side effects and TF control flow
Detailed information on AutoGraph limitations is available.
For functions with tf.v1 code
It is not allowed to create variables more than once in tf.function, but this is subject to change as tf.v1 code is phased out (Source)
For functions with tf.v2 code
No specific drawbacks
Examples of limitations
Creating variables more than once
It is not allowed to create variables more than once, such as v in the following example:
#tf.function
def f(x):
v = tf.Variable(1)
return tf.add(x, v)
f(tf.constant(2))
# => ValueError: tf.function-decorated function tried to create variables on non-first call.
In the following code, this is mitigated by making sure that self.v is only created once:
class C(object):
def __init__(self):
self.v = None
#tf.function
def f(self, x):
if self.v is None:
self.v = tf.Variable(1)
return tf.add(x, self.v)
c = C()
print(c.f(tf.constant(2)))
# => tf.Tensor(3, shape=(), dtype=int32)
Hidden side effects not captured by AutoGraph
Changes such as to self.a in this example can't be hidden, which leads to an error since cross-function analysis is not done (yet) (Source):
class C(object):
def change_state(self):
self.a += 1
#tf.function
def f(self):
self.a = tf.constant(0)
if tf.constant(True):
self.change_state() # Mutation of self.a is hidden
tf.print(self.a)
x = C()
x.f()
# => InaccessibleTensorError: The tensor 'Tensor("add:0", shape=(), dtype=int32)' cannot be accessed here: it is defined in another function or code block. Use return values, explicit Python locals or TensorFlow collections to access it. Defined in: FuncGraph(name=cond_true_5, id=5477800528); accessed from: FuncGraph(name=f, id=5476093776).
Changes in plain sight are no problem:
class C(object):
#tf.function
def f(self):
self.a = tf.constant(0)
if tf.constant(True):
self.a += 1 # Mutation of self.a is in plain sight
tf.print(self.a)
x = C()
x.f()
# => 1
Example of limitation due to TF control flow
This if statement leads to an error because the value for else needs to be defined for TF control flow:
#tf.function
def f(a, b):
if tf.greater(a, b):
return tf.constant(1)
# If a <= b would return None
x = f(tf.constant(3), tf.constant(2))
# => ValueError: A value must also be returned from the else branch. If a value is returned from one branch of a conditional a value must be returned from all branches.
tf.function is useful in creating and using computational graphs, they should be used in training and in deployment, however it isnt needed for most of your functions.
Lets say that we are building a special layer that will be apart of a larger model. We would not want to have the tf.function decorator above the function that constructs that layer because it is merely a definition of what the layer will look like.
On the other hand, lets say that we are going to either make a prediction or continue our training using some function. We would want to have the decorator tf.function because we are actually using the computational graph to get some value.
A great example would be constructing a encoder-decoder model.
DONT put the decorator around the function the create the encoder or decoder or any layer, that is only a definition of what it will do.
DO put the decorator around the "train" or "predict" method because those are actually going to use the computational graph for computation.
Per my understanding and according to the documentation, using tf.function is highly recommended mainly for speeding up your code since the code wrapped by tf.function would be converted to a graph and therefore there is a room for some optimizations (e.g. op pruning, folding, etc.) to be done which may not be performed when the same code is run eagerly.
However, there are also a few cases where using tf.function might incur additional overhead or does not result in noticeable speedups. One notable case is when the wrapped function is small and only used a few times in your code and therefore the overhead of calling the graph might be relatively large. Another case is when most of the computations are already done on an accelerator device (e.g. GPU, TPU), and therefore the speedups gained by graph computation might not be significant.
There is also a section in the documentation where the speedups are discussed in various scenarios, and at the beginning of this section the two cases above have been mentioned:
Just wrapping a tensor-using function in tf.function does not automatically speed up your code. For small functions called a few times on a single machine, the overhead of calling a graph or graph fragment may dominate runtime. Also, if most of the computation was already happening on an accelerator, such as stacks of GPU-heavy convolutions, the graph speedup won't be large.
For complicated computations, graphs can provide a significant speedup. This is because graphs reduce the Python-to-device communication and perform some speedups.
But at the end of the day, if it's applicable to your workflow, I think the best way to determine this for your specific use case and environment is to profile your code when it gets executed in eager mode (i.e. without using tf.function) vs. when it gets executed in graph mode (i.e. using tf.function extensively).

Tensorflow warning: two cells provided to MultiRNNCell are the same object

I have been consistently receiving the following warning while executing tensorflow scripts
WARNING:tensorflow:At least two cells provided to MultiRNNCell are the
same object and will share weights.
lstm_layer=rnn.LSTMBlockCell(num_units,forget_bias=1)
lstm_layer=rnn.DropoutWrapper(lstm_layer, output_keep_prob=output_keep_prob)
stacked_lstm = rnn.MultiRNNCell([lstm_layer] * num_layers)
outputs,_=rnn.static_rnn(stacked_lstm,input,dtype="float32")
However, the RNNs in question appear to be running fine, and are making accurate predictions.
What are the implications in relation to the warning message? Can it be safely ignored? If it is potential serious, how might its impact be evaluated?
You use [lstm_layer] * num_layers to create multiple RNN layers that actually refer to a same object in python. This usage is normal in some versions of tensorflow, and some versions will report errors.
As the warning says, since all RNN layers are the same object, their weights will remain the same. All errors are fed back to an RNN layer. It is equivalent to reducing the parameters of the model and reducing the complexity of the model.
If you want to create multiple different RNN layers and complex models, you can use the following usage. The effectiveness evaluation of these two different methods depends on the specific application scenarios and results. If your model results are good enough, more complex models don't make much sense.
rnn_layers = []
for _ in range(num_layers):
lstm_layer = rnn.LSTMBlockCell(num_units, forget_bias=1)
lstm_layer = rnn.DropoutWrapper(lstm_layer, output_keep_prob=output_keep_prob)
rnn_layers.append(lstm_layer)
stacked_lstm = rnn.MultiRNNCell(rnn_layers)

Is it a TensorFlow Best Practice for loss functions be callable (in the form of a function)? Other advantages besides Eager Execution compatibility?

Eager Execution requires any loss passed to any optimizer to be callabe, ie, in the form of a function.
So this is OK
def loss_function():
return tf.reduce_mean( tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
but this is NOT ok
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
And will raise this error
`loss` passed to Optimizer.compute_gradients should be a function when eager execution is enabled.
I've noticed that usually when a new Tensorflow feature requires some sort of practice, there are usually multiple benefits associated with that requirement even when not using that feature. For example, Eager execution also requires get_variable, to define variables. And as far as I can tell, there is no reason to use variable over get_variable
So are there any other advantages of having a loss function be callable outside of using Eager Execution?
On the loss issue (Akshay summed up the variable issue in the question comments):
When graph building, Tensors can behave like futures, and this is the way the Optimizer methods use it. Then this future can be evaluated many times in the training loop and the evaluation monitored to produce the gradient (which works by adding a bunch of ops to the graph which look at intermediate values).
Tensors are literal values when executing eagerly, so to get the same behavior it needs to be wrapped in a function (which is then run with a tf.GradientTape active to trace the gradient).
As for whether it's a best practice, new TensorFlow code being graph/eager agnostic is a good thing, although writing agnostic training loops is often more effort than it's worth. In general using Tensors as futures can be a bit harder to reason about.

How to get model output from tensorflow model without knowing its name in advance?

So I frequently run models with different architectures, but have code intended to apply to all of them which runs inference off the saved models. Thus, I will be calling eval() on the last layer of this model, like this:
yhat = graph.get_tensor_by_name("name_of_my_last_layer:0")
decoded_image = yhat.eval(session=sess, feed_dict={x : X})
However, without arduous log parsing, I don't know exactly what the last layer is named, and I'm currently hand-coding it. I've considered creating a generic 'output' tensor in my graph but that seems wasteful/brittle. What is the better way?
The best way is to either making the layer you want to analyse a model output or to fix its name (by passing the name= keyword argument to the layer function when creating the layer) to be a known string.

Tensorflow RNN input size

I am trying to use tensorflow to create a recurrent neural network. My code is something like this:
import tensorflow as tf
rnn_cell = tf.nn.rnn_cell.GRUCell(3)
inputs = [tf.constant([[0, 1]], dtype=tf.float32), tf.constant([[2, 3]], dtype=tf.float32)]
outputs, end = tf.nn.rnn(rnn_cell, inputs, dtype=tf.float32)
Now, everything runs just fine. However, I am rather confused by what is actually going on. The output dimensions are always the batch size x the size of the rnn cell's hidden state - how can they be completely independent of the input size?
If my understanding is correct, the inputs are concatenated to the rnn's hidden state at each step, and then multiplied by a weight matrix (among other operations). This means that the dimensions of the weight matrix need to depend on the input size, which is impossible, because the rnn_cell is created before the inputs are even declared!
After seeing the answer to a question about tensorflow's GRU implementation, I've realized what's going on. Counter to my intuition, the GRUCell constructor doesn't create any weight or bias variables at all. Instead, it creates its own variable scope, and then instantiates the variables on demand when actually called. Tensorflow's variable scoping mechanism ensures that the variables are only created once, and shared across subsequent calls to the GRU.
I'm not sure why they decided to go with this rather confusing implementation, which is as far as I can tell is undocumented. To me it seems more appropriate to use python's object-level variable scoping to encapsulate the tensorflow variables within the GRUCell itself, rather than relying on an additional implicit scoping mechanism.