Does Tensorflow support Keras models fit() method with eager execution? - tensorflow

I am training a Keras model (tf.keras.models.Sequential) calling its method fit().
Since I enabled eager execution, training time (for the same number of epochs) went up from 20.1s to 49.4s. Also, training didn't seem to converge anymore, as loss remained around 9 (without eager execution it went down to 1), while method fit() didn't even report the requested metric "accuracy" anymore.
Is eager execution support for Keras models? Note that I am calling method fit() on the model, not using an estimator.
Here the snippet of code that declares the model and does the training. Using TF 1.7 for GPU installed with pip3.
tf.enable_eager_execution()
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(11,)) ,
tf.keras.layers.Dense(64, activation='relu') ,
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(11, activation='softmax')
])
optimizer = tf.train.AdamOptimizer()
# optimizer = 'adam'
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x=train_X, y=train_y, epochs=200, batch_size=64, verbose=2)
UPDATE: filed issue #18642 on Tensorflow GITHUB.

The issue I reported on tensorflow got this answer:
Thank you for the bug report. We have a fix for this issue, that will
show up on GitHub soon.
See issue #18642 on GITHUB for Tensorflow.
Based on this, I understand that method fit() of Keras models will be supported with eager execution, once the bug is fixed.

Here is a quote from the Tensorflow site found here
There are many parameters to optimize when calculating derivatives. TensorFlow code is easier to read when structured into reusable classes and objects instead of a single top-level function. Eager execution encourages the use of the Keras-style layer classes in the tf.keras.layers module. Additionally, the tf.train.Optimizer classes provide sophisticated techniques to calculate parameter updates.
That means keras layers and subsequent models are allowed using Eager execution.
As for your timing, the link also mentions how using eager stops building of graphs.
TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, without an extra graph-building step. Operations return concrete values instead of constructing a computational graph to run later.
This may make it harder for your model to run given the number of DENSE layers you have. Someone may correct me on that because I have not done much work with DENSE layers before, or it has been a long time since I have. If that does not work then I would look into your loss function. This answer may help if that becomes a problem.
Everything else looks alright though. Hope this helps.
EDIT
Ok I see what you are saying Fate. Yeah the first link uses Sequential model, but Gradient tape fro gradient decent. Reading deeper into the eager tutorial shows that they only use Gradient tape as well. Here is what the tutorial says about training:
Automatic differentiation is useful for implementing machine learning algorithms such as backpropagation for training neural networks. During eager execution, use tfe.GradientTape to trace operations for computing gradients later.tfe.GradientTape is an opt-in feature to provide maximal performance when not tracing. Since different operations can occur during each call, all forward-pass operations get recorded to a "tape". To compute the gradient, play the tape backwards and then discard. A particular tfe.GradientTape can only be computed once, subsequent calls throw a runtime error.
So maybe as of right now only Gradient tape and the estimator method are what you are supposed to use with eager.

When reading the compile method on Model (documentation), you can find an argument, run_eagerly:
run_eagerly: Bool. Defaults to False. If True, this Model's logic will not be wrapped in a tf.function. Recommended to leave this as None unless your Model cannot be run inside a tf.function.
So by default, a tf.keras.Model will default to running through graph execution, not eager execution.

Related

does it matter if weights are loaded after or before a model is compiled?

I am using model.load_weights to load the weights of a keras model. I am wondering if it makes a difference if the weights are loaded before, or after the model is compiled.
No, it does not matter.
Compile defines the loss function, the optimizer and the metrics. If you compile a model after loading the weights, you will lose the optimizer states, but there is absolutely no damage to the weights.
It is explained in more in detail in this answer.

What effects should tensorflow.compat.v1.disable_v2_behavior() have on training using the Keras API?

I have a CNN that trains, on a few hundred thousand examples, to a validation accuracy of ~95% after one epoch. It's straight forward code, using Keras to define a network using the Sequential API. Originally I prepared and used this model on TF 1.3. When I port it over to TF 2.1, replacing the keras calls with tensorflow.keras, it gets to ~60% quickly and gets stuck there (seemingly for many epochs), and the training loss always seems to converge to the same value.
If I add in tf.disable_v2_behavior() at the top of the script, it trains similarly to before.
The documentation states simply that "It switches all global behaviors that are different between TensorFlow 1.x and 2.x to behave as intended for 1.x". Hidden behind the Keras API, I haven't found a clear answer to what this really means in practice. Why should I expect a VGG-like CNN, defined using Keras and trained with model.fit(), to work well without v2 behaviour but to fail so consistently with?
Edit: disable_eager_execution() produces the same result, with improved performance.
Please try disabling eager execution and see if that helps.
tf.compat.v1.disable_eager_execution()
(Add this to the top of your script)

Why is L2 regularization not added back into original loss function?

I'm aware that when using a kernal regularizer, particularly, l2 loss, I should bee add it back into the loss function and this is what is being done in other posts. However, in Keras, they are not following this process. Why is this so?
For instance, consider this and this notebook. They are using l2 loss as a kernal regularizer in some layers but not adding back into the original loss. Is this because of the particular loss, or is this a behavior followed in just Keras or am I completely misunderstanding everything?
Keras hides a lot of complexity (and this is not always a good thing).
You're using the Model abstraction: this model contains inside all the required information about the architecture and the training procedure.
When you invoke the method compile or train or train_on_batch you specify the loss function but under the hood what happens is:
Instantiate the loss function specified (e.g. categorical cross entropy)
Fetch from the model the regularizations applied and add all of them to the loss term previously instantiated
You can see the operations that are going to be added to the loss term accessing to the property .losses of the model instance (that's a list of tensorflow operations, usually all multilication operations, since the regularizations are in the for regularization_strenght * norm_p(variable).
The L2 regularization (or any weight regularization) in Keras is still added to the loss function in the same way as you would expect. It just happens behind the scene, so the user doesn't need to worry about it.
The notebooks you linked are the right way to use weight regularization in Keras.

Behaviour of Alpha Dropout in Training and Inference time

I am in the process of implementing a self normalizing neural network using the tensorflow. There are currently tensorflow "primitives" in the form of tf.nn.selu and tf.contrib.nn.alpha_dropout that should make this an easy process.
My problem is with tf.contrib.nn.alpha_dropout. I was expecting it to have a boolean switch for when you are in training and when you are in inference as does the usual dropout function used with other activation functions.
In the original implementation by the authors, we indeed see that they have this boolean switch (training) in the selu dropout function (dropout_selu).
Is there something I am missing?
tf.contrib.nn.alpha_dropout should be seen as an analogue to tf.nn.dropout. The latter function also does not have an argument for a training switch. It is not to be confused with tf.layers.dropout, which wraps tf.nn.dropout and has a training argument. As we can see in the implementation, the layers version returns either the result of nn.dropout or the identity depending on the training switch. It should be relatively easy to define your own wrapper around alpha_dropout in a similar manner.
To avoid any confusion: layers.dropout eventually calls the "keras layers" version of dropout which is the implementation linked above.

Is it possible to train pytorch and tensorflow model together on one GPU?

I have a pytorch model and a tensorflow model, I want to train them together on one GPU, following the process bellow: input --> pytorch model--> output_pytorch --> tensorflow model --> output_tensorflow --> pytorch model.
Is is possible to do this? If answer is yes, is there any problem which I will encounter?
Thanks in advance.
I haven't done this but it is possible but implementing is can be a little bit.
You can consider each network as a function, you want to - in some sense - compose these function to form your network, to do this you can compute the final function by just giving result of one network to the other and then use chain-rule to compute the derivatives(using symbolic differentiation from both packages).
I think a good way for implementing this you might be to wrap TF models as a PyTorch Function and use tf.gradients for computing the backward pass.
Doing gradient updates can really get hard (because some variables exist in TF's computation graph) you can turn TF variables to PyTorch Variable turn them into placeholdes in TF computation graph, feed them in feed_dict and update them using PyTorch mechanisms, but I think it would be really hard to do, instead if you do your updates inside backward method of the function you might be able to do the job(it is really ugly but might do the job).