Deep neural network diverges after convergence - tensorflow

I implemented the A3C network in in TensorFlow.
At this point I'm 90% sure the algorithm is implemented correctly. However, the network diverges after convergence. See the attached image that I got from a toy example where the maximum episode reward is 7.
When it diverges, policy network starts giving a single action very high probability (>0.9) for most states.
What should I check for this kind of problem? Is there any reference for it?

Note that in Figure 1 of the original paper the authors say:
For asynchronous methods we average over the best 5
models from 50 experiments.
That can mean that in lot of cases the algorithm does not work that well. From my experience, A3C often diverges, even after convergence. Carefull learning-rate scheduling can help. Or do what the authors did - learn several agents with different seed and pick the one performing the best on your validation data. You could also employ early stopping when validation error becomes to increase.


Reproducibility, Controlling Randomness, Operator-level Randomness in TFF

I have a TFF code that takes a slightly different optimization path while training across different runs, despite having set all the operator-level seeds, numpy seeds for sampling clients in each round, etc. The FAQ section on TFF website does talk about randomness and expectation in TFF, but I found the answer slightly confusing. Is it the case that some aspects of the randomness can't be directly controlled even after setting all the operator-level seeds that one could; because one can't control the way sub-sessions are started and ended?
To be more specific, these are all the operator-level seeds that my code already sets: dataset.shuffle, create_tf_dataset_from_all_clients, keras.initializers and np.random.seed for per-round client sampling (which uses numpy). I have verified that the initial model state is the same across runs, but as soon as training starts, the model states start diverging across different runs. The divergence is gradual/slow in most cases, but not always.
The code is quite complex, so not adding it here.
There is one more source of non-determinism that would be very hard to control -- summation of float32 numbers is not commutative.
When you simulate a number of clients in a round, the TFF executor does not have a way to control the order in which the model updates are added together. As a result, there could be some differences at the bottom of the float32 range. While this may sound negligible, it can add up over a number of rounds (I have seen hundreds, but could be also less), and eventually cause different loss/accuracy/model weights trajectories, as the gradients will start to be computed at slightly different points.
BTW, this tutorial has more info on best practices in controlling randomness in TFF.

Neural network hyperparameter tuning - is setting random seed a good idea? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am trying to tune a basic neural network as practice. (Based on an example from a coursera course: Neural Networks and Deep Learning - DeepLearning.AI)
I face the issue of the random weight initialization. Lets say I try to tune the number of layers in the network.
I have two options:
1.: set the random seed to a fixed value
2.: run my experiments more times without setting the seed
Both version has pros and cons.
My biggest concern is that if I use a random seed (e.g.: tf.random.set_seed(1)) then the determined values can be "over-fitted" to the seed and may not work well without the seed or if the value is changed (e.g.: tf.random.set_seed(1) -> tf.random.set_seed(2). On the other hand, if I run my experiments more times without random seed then I can inspect less option (due to limited computing capacity) and still only inspect a subset of possible random weight initialization.
In both cases I feel that luck is a strong factor in the process.
Is there a best practice how to handle this topic?
Has TensorFlow built in tools for this purpose? I appreciate any source of descriptions or tutorials. Thanks in advance!
Tuning hyperparameters in deep learning (generally in machine learning) is a common issue. Setting the random seed to a fixed number ensures reproducibility and fair comparison. Repeating the same experiment will lead to the same outcomes. As you probably know, best practice to avoid over-fitting is to do a train-test split of your data and then use k-fold cross-validation to select optimal hyperparameters. If you test multiple values for a hyperparameter, you want to make sure other circumstances that might influence the performance of your model (e.g. train-test-split or weight initialization) are the same for each hyperparameter in order to have a fair comparison of the performance. Therefore I would always recommend to fix the seed.
Now, the problem with this is, as you already pointed out, the performance for each model will still depend on the random seed, like the particular data split or weight initialization in your case. To avoid this, one can do repeated k-fold-cross validation. That means you repeat the k-fold cross-validation multiple times, each time with a different seed, select best parameters of that run, test on test data and average the final results to get a good estimate of performance + variance and therefore eliminate the influence the seed has in the validation process.
Alternatively you can perform k-fold cross validation a single time and train each split n-times with a different random seed (eliminating the effect of weight initialization, but still having the effect of the train-test-split).
Finally TensorFlow has no build-in tool for this purpose. You as practitioner have to take care of this.
There is no an absolute right or wrong answer to your question. You are almost answered your own question already. In what follows, however, I will try to expand more, via the following points:
The purpose of random initialization is to break the symmetry that makes neural networks fail to learn:
... the only property known with complete certainty is that the
initial parameters need to “break symmetry” between different units.
If two hidden units with the same activation function are connected to
the same inputs, then these units must have different initial
parameters. If they have the same initial parameters, then a
deterministic learning algorithm applied to a deterministic cost and
model will constantly update both of these units in the same way...
Deep Learning (Adaptive Computation and Machine Learning series)
Hence, we need the neural network components (especially weights) to be initialized by different values. There are some rules of thumb of how to choose those values, such as the Xavier initialization, which samples from normal distribution with mean of 0 and special variance based on the number of the network layer. This is a very interesting article to read.
Having said so, the initial values are important but not extremely critical "if" proper rules are followed, as per mentioned in point 2. They are important because large or improper ones may lead to vanishing or exploding gradient problems. On the other hand, different "proper" weights shall not hugely change the final results, unless they are making the aforementioned problems, or getting the neural network stuck at some local maxima. Please note, however, the the latter depends also on many other aspects, such as the learning rate, the activation functions used (some explode/vanish more than others: this is a great comparison), the architecture of the neural network (e.g. fully connected, convolutional ..etc: this is a cool paper) and the optimizer.
In addition to point 2, bringing a good learning optimizer into the bargain, other than the standard stochastic one, shall in theory not let a huge influence of the initial values to affect the final results quality, noticeably. A good example is Adam, which provides a very adaptive learning technique.
If you still get a noticeably-different results, with different "proper" initialized weights, there are some ways that "might help" to make neural network more stable, for example: use a Train-Test split, use a GridSearchCV for best parameters, and use k-fold cross validation...etc.
At the end, obviously the best scenario is to train the same network with different random initial weights many times then get the average results and variance, for more specific judgement on the overall performance. How many times? Well, if can do it hundreds of times, it will be better, yet that clearly is almost impractical (unless you have some Googlish hardware capability and capacity). As a result, we come to the same conclusion that you had in your question: There should be a tradeoff between time & space complexity and reliability on using a seed, taking into considerations some of the rules of thumb mentioned in previous points. Personally, I am okay to use the seed because I believe that, "It’s not who has the best algorithm that wins. It’s who has the most data". (Banko and Brill, 2001). Hence, using a seed with enough (define enough: it is subjective, but the more the better) data samples, shall not cause any concerns.

How to interpret "Value Loss" chart in TensorBoard?

I have a target-finding, obstacle-avoiding helicopter in Unity Machine Learning Agents. Looking at the TensorBoard for my training, I'm trying to get a feel for how to interpret the "Losses/Value Loss".
I've googled many articles on ML Loss, like this one, but I can't seem to get an intuitive understanding yet of what it all means for my little helicopter and possible changes I should implement, if any. (The helicopter is rewarded by getting closer and again for reaching the target, and punished by getting further or colliding. It measures a variety of things like relative speed, relative target position, ray sensors and so on, and it does basically work in target-finding, whereas more complicated maze type obstacles have not been tested or trained on yet. It's using 3 layers.) Thanks!
In reinforcement learning and specifically regarding actor/critic algorithms, value loss is the difference (or an average of many such differences) between the learning algorithm's expectation of a state's value and the empirically observed value of that state.
What is a state's value? A state's value is, in short, how much reward you can expect given that you start in that state. Immediate reward contributes completely to this amount. Reward that can possibly occur but not immediately contribute less, with more distant occurrences contributing less and less. We call this reduction in contribution to value a "discount", or we say that these rewards are "discounted".
Expected value is how much the critic part of the algorithm predicts the value to be. In the case of a critic implemented as a neural network, it's the output of the neural network with the state as its input.
Empirically observed value is the amount you get when you add up the rewards that you actually got when you left that state, plus any rewards (discounted by some amount) you got immediately after that for some number of steps (we'll say after these steps you ended up on state X), and (perhaps, depending on implementation) plus some discounted amount based on the value of state X.
In short, the smaller it is, the better it got at predicting how well it is going to perform. This doesn't mean that it gets better at playing - after all, one can be terrible at a game yet be accurate at predicting that they will lose and when they will lose if they learn to choose actions that will make them lose quickly!

what is a "convolution warmup"?

i encountered this phrase few times before, mostly in the context of neural networks and tensorflow, but i get the impression its something more general and not restricted to these environments.
here for example, they say that this "convolution warmup" process takes about 10k iterations.
why do convolutions need to warmup? what prevents them from reaching their top speed right away?
one thing that i can think of is memory allocation. if so, i would expect that it would be solved after 1 (or at least <10) iteration. why 10k?
edit for clarification: i understand that the warmup is a time period or number of iterations that have to be done until the convolution operator reaches its top speed (time per operator).
what i ask is - why is it needed and what happens during this time that makes the convolution faster?
Training neural networks works by offering training data, calculating the output error, and backpropagating the error back to the individual connections. For symmetry breaking, the training doesn't start with all zeros, but by random connection strengths.
It turns out that with the random initialization, the first training iterations aren't really effective. The network isn't anywhere near to the desired behavior, so the errors calculated are large. Backpropagating these large errors would lead to overshoot.
A warmup phase is intended to get the initial network away from a random network, and towards a first approximation of the desired network. Once the approximation has been achieved, the learning rate can be accelerated.
This is an empirical result. The number of iterations will depend on the complexity of your program domain, and therefore also with the complexity of the necessary network. Convolutional neural networks are fairly complex, so warmup is more important for them.
You are not alone to claiming the timer per iteration varies.
I run the same example and I get the same question.And I can say the main reason is the differnet input image shape and obeject number to detect.
I offer my test result to discuss it.
I enable trace and get the timeline at the first,then I found that Conv2D occurrences vary between steps in gpu stream all compulte,Then I use export TF_CUDNN_USE_AUTOTUNE=0 to disable autotune.
then there are same number of Conv2D in the timeline,and the time are about 0.4s .
the time cost are still different ,but much closer!

Is it feasibly to train an A3C algorithm in an episodic context?

The A3C Algorithm (and N-Step Q Learning) updates the globaly shared network once every N timesteps. N is usually pretty small, 5 or 20 as far as I remember.
Wouldn't it be possible to set N to infinity, meaning that the networks are only trained at the end of an episode? I do not argue that it is necessarily better - tough, for me it sounds like it could be - but at least it should not be a lot worse, right?
The lacking asynchronous training based on the asynchronous exploration of the enviroment by multiple agents in different enviroments, and therefore the stabilization of the training procedure without replay memory, might be a problem if the training is done sequentially (as in: for each worker thread, train the network on the whole observed SAR-sequence). Tough, the training could still be done asynchronously with sub-sequences, it would only make training with stateful LSTMs a little bit more complicated.
The reason why I am asking is the "Evolution Strategies as a Scalable Alternative to Reinforcement Learning" paper. To compare it to algorithms like A3C, it would make more sense - from a code engineering point of view - to train both algorithms in the same episodic way.
Definitely, just set N to be larger than the maximum episode length (or modify the source to remove the batching condition. Note that in the original A3C paper, this is done with the dynamic control environments (with continuous action spaces) with good results. It is commonly argued that being able to update mid-episode (not necessary) is a key advantage of TD methods: it uses the Markov condition of MDP.