Policy Gradient (REINFORCE) for invalid actions

Policy Gradient (REINFORCE) for invalid actions - policy

Currently I'm trying to implement the REINFORCE policy gradient method (with neural network) for a game. Now obviously, there are certain actions that are invalid in certain states (can't fire the rocket launcher if you don't have one!).
I tried to mask the softmax outputs (action probability) so that is only samples from valid actions. This works fine (or so it seems), however after several iterations of training, these actions are no longer being chosen (all outputs for these nodes turn into 0 for certain input combination). Interestingly, certain action node (invalid action) seems to give 1 (100% probability) in these cases.
This is causing a huge problem since I will then have to resort to randomly choosing the action to perform, which obviously doesn't do well. Is there any other ways to deal with the problem?
P.S. I'm updating the network by setting the "label" as the chosen action node having the value of discounted reward, while the remaining actions to be 0, then doing a categorical_crossentropy in Keras.

I ended up using 2 different approaches, but they both follow the methodology of applying invalid action masks.
One is to use the mask after obtaining the softmax values from policy gradient, then normalize the probability of the remaining actions and sample from those.
The second method is applying the mask after the logit layer, which is simpler and seems to have a better effect (although I didn't do any quantitative measurement to prove this).

Related

Why is gradient clipping not supported with a distribution strategy in Tensorflow?

It looks like gradient clipping is not supported using a distribution strategy
https://github.com/tensorflow/tensorflow/blob/f9f6b4cec2a1bdc5781e4896d80cee1336a2fbab/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L383
("Gradient clipping in the optimizer "
"(by setting clipnorm or clipvalue) is currently "
"unsupported when using a distribution strategy.")
Any reason for this? I am tempted to define a custom def _minimize(strategy, tape, optimizer, loss, trainable_variables): with direct clipping the gradients.

GitHub user tomerk wrote:
There's two possible places to clip when you have distribution
strategies enabled:
before gradients get aggregated (usually wrong)
after gradients get aggregated (usually right & what people expect)
We want it working w/ the second case (clipping after gradients are
aggregated). The issue is the optimizers are written with clipping
happening in the code before aggregation does.
We looked into changing this, but it would have required either:
api changes that break existing users of optimizer apply_gradients/other non-minimize methods
changing the signatures of methods optimizer implementers need to implement, breaking existing custom optimizers
So rather than:
quietly doing clipping in the wrong place
increasing churn & breaking existing users or existing custom optimizers just for this individual feature
We instead decided to leave this disabled for now. We'll roll support
for this into a larger optimizer refactoring that solves a larger set
of issues.
This has now been implemented.

Using LSTMs to predict from single-element sequence

I'm trying to apply reinforcement learning to a round-based game environment. Each round I get a (self-contained / nearly markovian) state and have to provide an action to progress in the world. Because there exist some long-term strategies (develop resource "A", wait few rounds for development, use resource "A"), I'm thinking of using an LSTM layer in my neural net. During training I can feed sequences of rounds into the network to train the LSTM; however, during the testing phase I'm only able to provide the current state (this is a hard requirement).
I'm wondering whether LSTMs are a viable option here or if they are not suitable for this usage, because I can only provide one state during testing / deployment.

Yes, LSTMs are a viable option here. In keras this would surmount to setting the field called "stateful" to true. What this does is to not reset the internal state of the cells between each sample, meaning that it would keep remembering the previous step(s) until this cell is reset.
In this case, you would simply set the LSTM stateful to true, hand it one sample per step and reset after the episode is done. Remember that you might not want to keep it stateful during training if there is enough signal that you can fit all the timesteps you need for finding the long term strategies into one sample, as you'd probably be doing replays over multiple episodes.
IF you're using anything else but keras, googling for stateful LSTM in xyz framework ought to help you further

Can predictions be trusted if learning curve shows validation error lower than training error?

I'm working with neural networks (NN) as a part of my thesis in geophysics, and is using TensorFlow with Keras for training my network.
My current task is to use a NN to approximate a thermodynamical model i.e a nonlinear regression problem. It takes 13 input parameters and outputs a velocity profile (velocity vs. depth) of 450 parameters. My data consists of 100,000 synthetic examples (i.e. no noise is present), split in training (80k), validation (10k) and testing (10k).
I've tested my network for a number of different architectures: wider (5-800 neurons) and deeper (up to 10 layers), different learning rates and batch sizes, and even for many epochs (5000). Basically all the standard tricks of the trade...
But, I am puzzled by the fact that the learning curve shows validation error lower than training error (for all my tests), and I've never been able to overfit to the training data. See figure below:
The error on the test set is correspondingly low, thus the network seems to be able to make decent predictions. It seems like a single hidden layer of 50 neurons is sufficient. However, I'm not sure if I can trust these results due to the behavior of the learning curve. I've considered that this might be due to the validation set consisting of examples that are "easy" to predict, but I cannot see how I should change this. A bigger validation set perhaps?
To wrap it up: Is is necessarily a bad sign if the validation error is lower than or very close to the training error? What if the predictions made with said network are decent?
Is it possible that overfitting is simply not possible for my problem and data?

In addition to trying a higher k fold and the additional testing holdout sample,perhaps mix it up when sampling from the original data set: Select a stratified sample when partitioning out the training and validation/test sets. Then partition the validation and test set without stratifying the sampling.
My opinion is that if you introduce more variation in your modeling methodology (without breaking any "statistical rules"), you can be more confident in the model that you have created.

You can achieve more trustworthy results by repeating your experiments on different data. Use cross validation with high fold (like k=10) to get better confidence of your solution performance. Usually neural networks easily overfit, if your solution has similar results on validation and test set its a good sign.

It is not that easy to tell when not knowing the exact way you have setup the experiment:
what cross-validation method did you use?
how did you split the data?
etc
As you mentioned, the fact that you observe validation error lower than training can be a result of the fact that either the training dataset contains many "hard" cases to learn or the validation set contains many "easy" cases to predict.
However, since generally speaking training loss is expected to underestimate the validation, to me the specific model appear to have unpredictable/unknown fit (perform better in predicting the unknown that the known feels indeed weird).
In order to overcome this, I would start experimenting by reconsidering the data splitting strategy, adding more data if possible, or even change your performance metric.

Object detection project (root architecture) using Tensorflow + Keras. Image sample size for accurate training of model?

Im currenty working on a project at University, where we are using python + tensorflow and keras to train an image object detector, to detect different parts of the root system of Arabidopsis.
Our current ressults are pretty bad, as we do only have about 100 images to train the model with at this moment, but we are currently working on cultuvating more plants in order to get more images(more data) to train the tensorflow model.
We have implemented the following Mask_RCNN model:Github- Mask_RCNN tensorflow
We are looking to detect three object clases: stem, main root and secondary root.
But the model detects main roots incorrectly where the secondary roots are located.
It should be able to detect something like this:Root detection example
Training root data set that we are using right now:training images
What is the usual sample size that is used to train a neural network accurate results?

First off: I think there is no simple rule to estimate the sample size but at least it depends on:
1. Quality of your images
I downloaded the images and I think you need to preprocess them before you can use it to reduce the "problem complexity". In some projects, in which I worked with biological data, a background removal (image - low pass filter) was the key to get better results. But you should definitely remove/crop the area outside the region of your interest (like the tape and the ruler). I would try to get the cleanest data set as possible (including manually adjustments cv2/ gimp/ etc.) to focus the network to solve "the right problem".. After that you could apply some random distortion to make it also work on fuzzy/bad/realistic images as well.
2. The way you work with your data
There are a few tricks that enables you to "expand" your dataset.
Sometimes it's very helpful to let a generator method crop random small patches from your input data. This allows you to work with more batches (on small gpus) and gives your network more "variety", (just think about the conv2d task: if you don't use random cropping your filters will slide over the same areas over and over again (at the same image)). Because of the same reason: apply random distortion, flip and rotate your images.
3. Network architecture
In your case I would prefer a U-Net architecture with a last conv2d output of 3 (your classes) feature maps, a final softmax activation and an categorical_crossentropy, this enables you to play with the depth, because sometimes you need sophisticated architectures to solve a problem (close to 100%) but in your case you just want to see a first working result. So fewer layers and a simple architecture could also help you to get things work. Maybe there are some trained network weights for a U-Net which meets your requirements (search on kaggle for example). Because it is also helpful (to reduce the data you need) to use "transfer learning" -> use the first layers of an network (weights) which is already trained. Using a semantic segmentation the first filters will become something like an edge detection for the most given problems/images.
4. Your mental model of "accurate results"
This is the hardest part.. because it evolves during your project. Eg. in the same moment your networks starts to perform well on preprocessed input images you will start to think about architecture/data changes to make it work on fuzzy images as well. This is why you should start with a feasible problem but always improve your dataset (including rare kinds of roots) and tune your network architecture step by step.

deep q learning is not converging

I'm experimenting with deep q learning using Keras , and i want to teach an agent to perform a task .
in my problem i wan't to teach an agent to avoid hitting objects in it's path by changing it's speed (accelerate or decelerate)
the agent is moving horizontally and the objects to avoid are moving vertically and i wan't him to learn to change it's speed in a way to avoid hitting them .
i based my code on this : Keras-FlappyBird
i tried 3 different models (i'm not using convolution network)
model with 10 dense hidden layer with sigmoid activation function , with 400 output node
model with 10 dense hidden layer with Leaky ReLU activation function
model with 10 dense hidden layer with ReLu activation function, with 400 output node
and i feed to the network the coordinates and speeds of all the object in my word to the network .
and trained it for 1 million frame but still can't see any result
here is my q-value plot for the 3 models ,
Model 1 : q-value
Model 2 : q-value
Model 3 : q-value
Model 3 : q-value zoomed
as you can see the q values isn't improving at all same as fro the reward ... please help me what i'm i doing wrong ..

I am a little confused by your environment. I am assuming that your problem is not flappy bird, and you are trying to port over code from flappy bird into your own environment. So even though I don't know your environment or your code, I still think there is enough to answer some potential issues to get you on the right track.
First, you mention the three models that you have tried. Of course, picking the right function approximation is very important for generalized reinforcement learning, but there are so many more hyper-parameters that could be important in solving your problem. For example, there is the gamma, learning rate, exploration and exploration decay rate, replay memory length in certain cases, batch size of training, etc. With your Q-value not changing in a state that you believe should in fact change, leads me to believe that limited exploration is being done for models one and two. In the code example, epsilon starts at .1, maybe try different values there up to 1. Also that will require messing with the decay rate of the exploration rate as well. If your q values are shooting up drastically across episodes, I would also look at the learning rate as well (although in the code sample, it looks pretty small). On the same note, gamma can be extremely important. If it is too small, you learner will be myopic.
You also mention you have 400 output nodes. Does your environment have 400 actions? Large action spaces also come with their own set of challenges. Here is a good white paper to look at if indeed you do have 400 actions https://arxiv.org/pdf/1512.07679.pdf. If you do not have 400 actions, something is wrong with your network structure. You should treat each of the output nodes as a probability of which action to select. For example, in the code example you posted, they have two actions and use relu.
Getting the parameters of deep q learning right is very difficult, especially when you account for how slow training is.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas