deep q learning is not converging - tensorflow

I'm experimenting with deep q learning using Keras , and i want to teach an agent to perform a task .
in my problem i wan't to teach an agent to avoid hitting objects in it's path by changing it's speed (accelerate or decelerate)
the agent is moving horizontally and the objects to avoid are moving vertically and i wan't him to learn to change it's speed in a way to avoid hitting them .
i based my code on this : Keras-FlappyBird
i tried 3 different models (i'm not using convolution network)
model with 10 dense hidden layer with sigmoid activation function , with 400 output node
model with 10 dense hidden layer with Leaky ReLU activation function
model with 10 dense hidden layer with ReLu activation function, with 400 output node
and i feed to the network the coordinates and speeds of all the object in my word to the network .
and trained it for 1 million frame but still can't see any result
here is my q-value plot for the 3 models ,
Model 1 : q-value
Model 2 : q-value
Model 3 : q-value
Model 3 : q-value zoomed
as you can see the q values isn't improving at all same as fro the reward ... please help me what i'm i doing wrong ..

I am a little confused by your environment. I am assuming that your problem is not flappy bird, and you are trying to port over code from flappy bird into your own environment. So even though I don't know your environment or your code, I still think there is enough to answer some potential issues to get you on the right track.
First, you mention the three models that you have tried. Of course, picking the right function approximation is very important for generalized reinforcement learning, but there are so many more hyper-parameters that could be important in solving your problem. For example, there is the gamma, learning rate, exploration and exploration decay rate, replay memory length in certain cases, batch size of training, etc. With your Q-value not changing in a state that you believe should in fact change, leads me to believe that limited exploration is being done for models one and two. In the code example, epsilon starts at .1, maybe try different values there up to 1. Also that will require messing with the decay rate of the exploration rate as well. If your q values are shooting up drastically across episodes, I would also look at the learning rate as well (although in the code sample, it looks pretty small). On the same note, gamma can be extremely important. If it is too small, you learner will be myopic.
You also mention you have 400 output nodes. Does your environment have 400 actions? Large action spaces also come with their own set of challenges. Here is a good white paper to look at if indeed you do have 400 actions https://arxiv.org/pdf/1512.07679.pdf. If you do not have 400 actions, something is wrong with your network structure. You should treat each of the output nodes as a probability of which action to select. For example, in the code example you posted, they have two actions and use relu.
Getting the parameters of deep q learning right is very difficult, especially when you account for how slow training is.

Related

Can predictions be trusted if learning curve shows validation error lower than training error?

I'm working with neural networks (NN) as a part of my thesis in geophysics, and is using TensorFlow with Keras for training my network.
My current task is to use a NN to approximate a thermodynamical model i.e a nonlinear regression problem. It takes 13 input parameters and outputs a velocity profile (velocity vs. depth) of 450 parameters. My data consists of 100,000 synthetic examples (i.e. no noise is present), split in training (80k), validation (10k) and testing (10k).
I've tested my network for a number of different architectures: wider (5-800 neurons) and deeper (up to 10 layers), different learning rates and batch sizes, and even for many epochs (5000). Basically all the standard tricks of the trade...
But, I am puzzled by the fact that the learning curve shows validation error lower than training error (for all my tests), and I've never been able to overfit to the training data. See figure below:
The error on the test set is correspondingly low, thus the network seems to be able to make decent predictions. It seems like a single hidden layer of 50 neurons is sufficient. However, I'm not sure if I can trust these results due to the behavior of the learning curve. I've considered that this might be due to the validation set consisting of examples that are "easy" to predict, but I cannot see how I should change this. A bigger validation set perhaps?
To wrap it up: Is is necessarily a bad sign if the validation error is lower than or very close to the training error? What if the predictions made with said network are decent?
Is it possible that overfitting is simply not possible for my problem and data?
In addition to trying a higher k fold and the additional testing holdout sample,perhaps mix it up when sampling from the original data set: Select a stratified sample when partitioning out the training and validation/test sets. Then partition the validation and test set without stratifying the sampling.
My opinion is that if you introduce more variation in your modeling methodology (without breaking any "statistical rules"), you can be more confident in the model that you have created.
You can achieve more trustworthy results by repeating your experiments on different data. Use cross validation with high fold (like k=10) to get better confidence of your solution performance. Usually neural networks easily overfit, if your solution has similar results on validation and test set its a good sign.
It is not that easy to tell when not knowing the exact way you have setup the experiment:
what cross-validation method did you use?
how did you split the data?
etc
As you mentioned, the fact that you observe validation error lower than training can be a result of the fact that either the training dataset contains many "hard" cases to learn or the validation set contains many "easy" cases to predict.
However, since generally speaking training loss is expected to underestimate the validation, to me the specific model appear to have unpredictable/unknown fit (perform better in predicting the unknown that the known feels indeed weird).
In order to overcome this, I would start experimenting by reconsidering the data splitting strategy, adding more data if possible, or even change your performance metric.

Object detection project (root architecture) using Tensorflow + Keras. Image sample size for accurate training of model?

Im currenty working on a project at University, where we are using python + tensorflow and keras to train an image object detector, to detect different parts of the root system of Arabidopsis.
Our current ressults are pretty bad, as we do only have about 100 images to train the model with at this moment, but we are currently working on cultuvating more plants in order to get more images(more data) to train the tensorflow model.
We have implemented the following Mask_RCNN model:Github- Mask_RCNN tensorflow
We are looking to detect three object clases: stem, main root and secondary root.
But the model detects main roots incorrectly where the secondary roots are located.
It should be able to detect something like this:Root detection example
Training root data set that we are using right now:training images
What is the usual sample size that is used to train a neural network accurate results?
First off: I think there is no simple rule to estimate the sample size but at least it depends on:
1. Quality of your images
I downloaded the images and I think you need to preprocess them before you can use it to reduce the "problem complexity". In some projects, in which I worked with biological data, a background removal (image - low pass filter) was the key to get better results. But you should definitely remove/crop the area outside the region of your interest (like the tape and the ruler). I would try to get the cleanest data set as possible (including manually adjustments cv2/ gimp/ etc.) to focus the network to solve "the right problem".. After that you could apply some random distortion to make it also work on fuzzy/bad/realistic images as well.
2. The way you work with your data
There are a few tricks that enables you to "expand" your dataset.
Sometimes it's very helpful to let a generator method crop random small patches from your input data. This allows you to work with more batches (on small gpus) and gives your network more "variety", (just think about the conv2d task: if you don't use random cropping your filters will slide over the same areas over and over again (at the same image)). Because of the same reason: apply random distortion, flip and rotate your images.
3. Network architecture
In your case I would prefer a U-Net architecture with a last conv2d output of 3 (your classes) feature maps, a final softmax activation and an categorical_crossentropy, this enables you to play with the depth, because sometimes you need sophisticated architectures to solve a problem (close to 100%) but in your case you just want to see a first working result. So fewer layers and a simple architecture could also help you to get things work. Maybe there are some trained network weights for a U-Net which meets your requirements (search on kaggle for example). Because it is also helpful (to reduce the data you need) to use "transfer learning" -> use the first layers of an network (weights) which is already trained. Using a semantic segmentation the first filters will become something like an edge detection for the most given problems/images.
4. Your mental model of "accurate results"
This is the hardest part.. because it evolves during your project. Eg. in the same moment your networks starts to perform well on preprocessed input images you will start to think about architecture/data changes to make it work on fuzzy images as well. This is why you should start with a feasible problem but always improve your dataset (including rare kinds of roots) and tune your network architecture step by step.

How to make a model of 10000 Unique items using tensorflow? Will it scale?

I have a use case where I have around 100 images each of 10000 unique items. I have 10 items with me which are all from the 10000 set and I know which 10 items too but only at the time of testing on live data. I have to now match the 10 items with their names. What would be an efficient way to recognise these items? I have full control of training environment background and the testing environment background. If I make one model of all 10000 items, will it scale? Or should I make 10000 different models and run the 10 items on the 10 models I have pretrained.
Your question is regarding something called "one-vs-all classification" you can do a google search for that, the first hit is a video lecture by Andrew Ng that's almost certainly worth watching.
The question has been long studied and in a plethora of contexts. The answer to your question does very much depend on what model you use. But I'll assume that, if you're doing image classification, you are using convolutional neural networks, because, after all, they're state of the art for most such image classification tasks.
In the context of convolutional networks, there is something called "Multi task learning" that you should read up on. Boiled down to a single sentence, the concept is that the more you ask the network to learn the better it is at the individual tasks. So, in this case, you're almost certain to perform better training 1 model on 10,000 classes than 10,000 classes each performing a one-vs-all classification scheme.
Take for example the 1,000 class Imagenet dataset, and CIFAR-10's 10 class dataset. It has been demonstrated in numerous papers that first training against Imagenet's 1,000 class dataset, and then simply replacing the last layer with a 10 class output and re-training on CIFAR-10's dataset will produce a better result than just training on CIFAR-10's dataset alone. There are admittedly multiple reasons for this result, Imagenet is a larger dataset. But the richness of class labels, multi-task learning, in the Imagenet dataset is certainly among the reasons for this result.
So that was a long winded way of saying, use one model with 10,000 classes.
An aside:
If you want to get really, really interesting, and jump into the realm of research level thinking, you might consider a 1-hot vector of 10,000 classes rather sparse and start thinking about whether you could reduce the dimensionality of your output layer using an embedding. An embedding would be a dense vector, let's say size 100 as a good starting point. Now class labels turn into clusters of points in your 100 dimensional space. I bet your network will perform even better under these conditions.
If this little aside didn't make sense, it's completely safe to ignore it, your 10,000 class output is fine. But if it did peek your interest look up information on Word2Vec, and read this really nice post on how face recognition is achieved using embeddings: https://medium.com/#ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78. You might also consider using an Auto Encoder to generate an embedding for the images (though I favor triplet embeddings as typically used in face recognition myself).

Cells detection using deep learning techniques

I have to analyse some images of drops, taken using a microscope, which may contain some cell. What would be the best thing to do in order to do it?
Every acquisition of images returns around a thousand pictures: every picture contains a drop and I have to determine whether the drop has a cell inside or not. Every acquisition dataset presents with a very different contrast and brightness, and the shape of the cells is slightly different on every setup due to micro variations on the focus of the microscope.
I have tried to create a classification model following the guide "TensorFlow for poets", defining two classes: empty drops and drops containing a cell. Unfortunately the result wasn't successful.
I have also tried to label the cells and giving to an object detection algorithm using DIGITS 5, but it does not detect anything.
I was wondering if these algorithms are designed to recognise more complex object or if I have done something wrong during the setup. Any solution or hint would be helpful!
Thank you!
This is a collage of drops from different samples: the cells are a bit different from every acquisition, due to the different setup and ambient lights
This kind of problem should definitely be possible. I would suggest starting with a cifar 10 convolutional neural network tutorial and customizing it for your problem.
In future posts you should tell us how your training is progressing. Make sure you're outputting the following information every few steps (maybe every 10-100 steps):
Loss/cost function output, you should see your loss decreasing over time.
Classification accuracy on the current batch of your training data
Classification accuracy on a held out test set (if you've implemented test set evaluation, you might implement this second)
There are many, many, many things that can go wrong, from bad learning rates, to preprocessing steps that go awry. Neural networks are very hard to debug, they are very resilient to bugs, making it hard to even know if you have a bug in your code. For that reason make sure you're visualizing everything.
Another very important step to follow is to save the images exactly as you are passing them to tensorflow. You will have them in a matrix form, you can save that matrix form as an image. Do that immediately before you pass the data to tensorflow. Make sure you are giving the network what you expect it to receive. I can't tell you how many times I and others I know have passed garbage into the network unknowingly, assume the worst and prove yourself wrong!
Your next post should look something like this:
I'm training a convolutional neural network in tensorflow
My loss function (sigmoid cross entropy) is decreasing consistently (show us a picture!)
My input images look like this (show us a picture of what you ACTUALLY FEED to the network)
My learning rate and other parameters are A, B, and C
I preprocessed the data by doing M and N
The accuracy the network achieves on training data (and/or test data) is Y
In answering those questions you're likely to solve 10 problems along the way, and we'll help you find the 11th and, with some luck, last one. :)
Good luck!

Neural network weights explode in linear unit

I am currently implementing a simple neural network and the backprop algorithm in Python with numpy. I have already tested my backprop method using central differences and the resulting gradient is equal.
However, the network fails to approximate a simple sine curve. The network hast one hidden layer (100 neurons) with tanh activation functions and a output layer with a linear activation function. Each unit hast also a bias input. The training is done by simple gradient descent with a learning rate of 0.2.
The problem arises from the gradient, which gets with every epoch larger, but I don't know why? Further, the problem is unchanged, if I decrease the learning rate.
EDIT: I have uploaded the code to pastebin: http://pastebin.com/R7tviZUJ
There are two things you can try, maybe in combination:
Use a smaller learning rate. If it is too high, you may be overshooting the minimum in the current direction by a lot, and so your weights will keep getting larger.
Use smaller initial weights. This is related to the first item. A smaller learning rate would fix this as well.
I had a similar problem (with a different library, DL4J), even in the case of extremely simple target functions. In my case, the issue turned out to be the cost function. When I changed from negative log likelihood to Poisson or L2, I started to get decent results. (And my results got MUCH better once I added exponential learning rate decay.)
Looks like you dont use regularization. If you train your network long enough it will start to learn the excact data rather than abstract pattern.
There are a couple of method to regularize your network like: stopped training, put a high cost to large gradients or more complex like e.g.g drop out. If you search web/books you probably will find many options for this.
A too big learning rate can fail to converge, and even DIVERGE, that is the point.
The gradient could diverge for this reason: when exceeding the position of the minima, the resulting point could not only be a bit further, but could even be at a greater distance than initially, but the other side. Repeat the process, and it will continue to diverge. in other words, the variation rate around the optimal position could be just to big compared to the learning rate.
Source: my understanding of the following video (watch near 7:30).
https://www.youtube.com/watch?v=Fn8qXpIcdnI&list=PLLH73N9cB21V_O2JqILVX557BST2cqJw4&index=10