I am in the process of implementing a self normalizing neural network using the tensorflow. There are currently tensorflow "primitives" in the form of tf.nn.selu and tf.contrib.nn.alpha_dropout that should make this an easy process.
My problem is with tf.contrib.nn.alpha_dropout. I was expecting it to have a boolean switch for when you are in training and when you are in inference as does the usual dropout function used with other activation functions.
In the original implementation by the authors, we indeed see that they have this boolean switch (training) in the selu dropout function (dropout_selu).
Is there something I am missing?
tf.contrib.nn.alpha_dropout should be seen as an analogue to tf.nn.dropout. The latter function also does not have an argument for a training switch. It is not to be confused with tf.layers.dropout, which wraps tf.nn.dropout and has a training argument. As we can see in the implementation, the layers version returns either the result of nn.dropout or the identity depending on the training switch. It should be relatively easy to define your own wrapper around alpha_dropout in a similar manner.
To avoid any confusion: layers.dropout eventually calls the "keras layers" version of dropout which is the implementation linked above.
Related
This is a more general version of a question I've already asked: Significant difference between outputs of deep tensorflow keras model in Python and tensorflowjs conversion
As far as I can tell, the layers of a tfjs model when run in the browser (so far only tested in Chrome and Firefox) will have small numerical differences in the output values when compared to the same model run in Python or Node. The cumulative effect of these small differences across all the layers of the model can cause fairly significant differences in the output. See here for an example of this.
This means a model trained in Python or Node will not perform as well in terms of accuracy when run in the browser. And the deeper your model, the worse it will get.
Therefore my question is, what is the best way to train a model to use with tfjs in the browser? Is there a way to ensure the output will be identical? Or do you just have to accept that there will be small numerical differences and, if so, are there any methods that can be used to train a model to be more resilient to this?
This answer is based on my personal observations. As such, it is debatable and not backed by much evidence. Some things that I follow to get accuracy of 16-bit models close to 32 bit models are:
Avoid using activations that have small upper and lower bounds, such as sigmoid or tanh, for hidden layers. These activations cause the weights of the next layer to become very sensitive to small values, and hence, small changes. I prefer using ReLU for such models. Since it is now the standard activation for hidden layers in most models, you should be using it in any case.
Avoid weight decay and L1/L2 regularizations on weights while training (the kernel_regularizer parameter in keras), since these increase sensitivity of weights. Use Dropout instead, I didn't observe a major drop in performance on TFLite when using it instead of numerical regularizers.
I'm aware that when using a kernal regularizer, particularly, l2 loss, I should bee add it back into the loss function and this is what is being done in other posts. However, in Keras, they are not following this process. Why is this so?
For instance, consider this and this notebook. They are using l2 loss as a kernal regularizer in some layers but not adding back into the original loss. Is this because of the particular loss, or is this a behavior followed in just Keras or am I completely misunderstanding everything?
Keras hides a lot of complexity (and this is not always a good thing).
You're using the Model abstraction: this model contains inside all the required information about the architecture and the training procedure.
When you invoke the method compile or train or train_on_batch you specify the loss function but under the hood what happens is:
Instantiate the loss function specified (e.g. categorical cross entropy)
Fetch from the model the regularizations applied and add all of them to the loss term previously instantiated
You can see the operations that are going to be added to the loss term accessing to the property .losses of the model instance (that's a list of tensorflow operations, usually all multilication operations, since the regularizations are in the for regularization_strenght * norm_p(variable).
The L2 regularization (or any weight regularization) in Keras is still added to the loss function in the same way as you would expect. It just happens behind the scene, so the user doesn't need to worry about it.
The notebooks you linked are the right way to use weight regularization in Keras.
I am training a Keras model (tf.keras.models.Sequential) calling its method fit().
Since I enabled eager execution, training time (for the same number of epochs) went up from 20.1s to 49.4s. Also, training didn't seem to converge anymore, as loss remained around 9 (without eager execution it went down to 1), while method fit() didn't even report the requested metric "accuracy" anymore.
Is eager execution support for Keras models? Note that I am calling method fit() on the model, not using an estimator.
Here the snippet of code that declares the model and does the training. Using TF 1.7 for GPU installed with pip3.
tf.enable_eager_execution()
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(11,)) ,
tf.keras.layers.Dense(64, activation='relu') ,
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(11, activation='softmax')
])
optimizer = tf.train.AdamOptimizer()
# optimizer = 'adam'
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x=train_X, y=train_y, epochs=200, batch_size=64, verbose=2)
UPDATE: filed issue #18642 on Tensorflow GITHUB.
The issue I reported on tensorflow got this answer:
Thank you for the bug report. We have a fix for this issue, that will
show up on GitHub soon.
See issue #18642 on GITHUB for Tensorflow.
Based on this, I understand that method fit() of Keras models will be supported with eager execution, once the bug is fixed.
Here is a quote from the Tensorflow site found here
There are many parameters to optimize when calculating derivatives. TensorFlow code is easier to read when structured into reusable classes and objects instead of a single top-level function. Eager execution encourages the use of the Keras-style layer classes in the tf.keras.layers module. Additionally, the tf.train.Optimizer classes provide sophisticated techniques to calculate parameter updates.
That means keras layers and subsequent models are allowed using Eager execution.
As for your timing, the link also mentions how using eager stops building of graphs.
TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, without an extra graph-building step. Operations return concrete values instead of constructing a computational graph to run later.
This may make it harder for your model to run given the number of DENSE layers you have. Someone may correct me on that because I have not done much work with DENSE layers before, or it has been a long time since I have. If that does not work then I would look into your loss function. This answer may help if that becomes a problem.
Everything else looks alright though. Hope this helps.
EDIT
Ok I see what you are saying Fate. Yeah the first link uses Sequential model, but Gradient tape fro gradient decent. Reading deeper into the eager tutorial shows that they only use Gradient tape as well. Here is what the tutorial says about training:
Automatic differentiation is useful for implementing machine learning algorithms such as backpropagation for training neural networks. During eager execution, use tfe.GradientTape to trace operations for computing gradients later.tfe.GradientTape is an opt-in feature to provide maximal performance when not tracing. Since different operations can occur during each call, all forward-pass operations get recorded to a "tape". To compute the gradient, play the tape backwards and then discard. A particular tfe.GradientTape can only be computed once, subsequent calls throw a runtime error.
So maybe as of right now only Gradient tape and the estimator method are what you are supposed to use with eager.
When reading the compile method on Model (documentation), you can find an argument, run_eagerly:
run_eagerly: Bool. Defaults to False. If True, this Model's logic will not be wrapped in a tf.function. Recommended to leave this as None unless your Model cannot be run inside a tf.function.
So by default, a tf.keras.Model will default to running through graph execution, not eager execution.
I've been working on a prototype and I am having issues with backpropagation.I am currently using the latest keras and tensorflow build ( as tensorflow as a backend, I have looked into cntk, mxnet, and chainer; so far only chainer would allow me to do it but the training time is quite slow..)
My current layer is similar to a convolutional layer with more operations than a simple multiplication.
I know that tensorflow should use automatic differentiation if all the operations support it to calculate the gradient and perform gradient descent.
Currently my layer uses the following operator : reduce_sum, sum, subtraction, multiplication and division.
I also relies on the following methods: extract_image_patches, reshape, transpose.
I doubt any of these would cause an issue with automatic gradient descent. I built 2 layers as tests, one inherits from the base layer in keras while the other inherit directly from _Conv. In both cases whenever I use that layer anywhere in a model no weights are updated during the training process.
How could I solve this problem and fix backpropagation?
Edit:
(Here is the layer implementation https://github.com/roya0045/cvar2/blob/master/tfvar.py,
for the testing iteself see https://github.com/roya0045/cvar2/blob/master/test2.py )
I have a pytorch model and a tensorflow model, I want to train them together on one GPU, following the process bellow: input --> pytorch model--> output_pytorch --> tensorflow model --> output_tensorflow --> pytorch model.
Is is possible to do this? If answer is yes, is there any problem which I will encounter?
Thanks in advance.
I haven't done this but it is possible but implementing is can be a little bit.
You can consider each network as a function, you want to - in some sense - compose these function to form your network, to do this you can compute the final function by just giving result of one network to the other and then use chain-rule to compute the derivatives(using symbolic differentiation from both packages).
I think a good way for implementing this you might be to wrap TF models as a PyTorch Function and use tf.gradients for computing the backward pass.
Doing gradient updates can really get hard (because some variables exist in TF's computation graph) you can turn TF variables to PyTorch Variable turn them into placeholdes in TF computation graph, feed them in feed_dict and update them using PyTorch mechanisms, but I think it would be really hard to do, instead if you do your updates inside backward method of the function you might be able to do the job(it is really ugly but might do the job).