Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm currently training a CNN to detect if a person wears a mask. Unfortunately, I do not understand why my validation loss is so high. As I noticed, the data I am validating on is in is sorted after classes (which are the output of the net). Does that have any impact on my validation accuracy and loss?
I tested the model with the use of Computer Vision and it works excellent but the validation loss and accuracy still looks very wrong.
What are the reasons to that?
This phenomenon, at an intuitive level, can take place due to several factors:
It may be the case that you are using very big batch sizes (>=128) which can cause those fluctuations since the convergence can be negatively impacted if the batch size is too high. There are several papers that have studied this phenomenon. This may or may not be the case for you.
It is probable that your validation set is too small. I experienced such fluctuations when the validation set was too small (in number, not necessarily percentage split between training-validation). In such circumstances, a change in weights after an epoch has a more visible impact on the validation loss (and automatically on the validation accuracy to a great extent).
In my opinion and according to my experience, if you consider/checked that your model works well in the real life, you can decide to train only for 50 epochs, since you can see from the graph that it is a optimal cut-off point, as the fluctuations intensify after that point and also a small overfitting phenomenon may be observed.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I read that ResNet solves the problem of vanishing gradient problem by using skip functions. But are they not already solved using RELU? Is there some other important thing I'm missing about ResNet or does Vanishing gradient problem occur even after using RELU?
The ReLU activation solves the problem of vanishing gradient that is due to sigmoid-like non-linearities (the gradient vanishes because of the flat regions of the sigmoid).
The other kind of "vanishing" gradient seems to be related to the depth of the network (e.g. see this for example). Basically, when backpropagating gradient from layer N to N-k, the gradient vanishes as a function of depth (in vanilla architectures). The idea of resnets is to help with gradient backpropagation (see for example Identity mappings in deep residual networks, where they present resnet v2 and argue that identity skip connections are better at this).
A very interesting and relatively recent paper that sheds light on the working on resnets is resnets behaves as ensembles of relatively small networks. The tl;dr of this paper could be (very roughly) summarized as this: residual networks behave as an ensemble: removing a single layer (i.e. a single residual branch, not its skip connection) doesn't really affect performance, but performance decreases in an smooth manner as a function of the number of layers that are removed, which is the way in which ensembles behave. Most of the gradient during training comes from short paths. They show that training only this short paths doesn't affect performance in a statistically significant way compared to when all paths are trained. This means that the effect of residual networks doesn't really come from depth as the effect of long paths is almost non-existant.
The main purpose of ResNet is to create more deeper models. In theory deeper models (speaking about VGG models) must show better accuracy, but in the real life they usually do not. But if we add short-connection to the model, we can increase the number of layers and accuracy as well
While the ReLU activation function does solve the problem of vanishing gradients, it does not provide the deeper layers with extra information as in the case of ResNets. The idea of propagating the original input data as deep as possible through the network hence helping the network learn much more complex features is why ResNet architecture was introduced and achieves such high accuracy on a variety of tasks.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Recently, I've trained an AI for the game Snake using deep reinforcement learning. For this I've used the Keras API with tensorflow-gpu 2.
(Not sure if you need to look at the code, but if so, these are the most important parts:
https://github.com/aeduin/snake_ai/blob/2e935fd124a54ef19f4287d19be9bd88af50f337/train.py
https://github.com/aeduin/snake_ai/blob/2e935fd124a54ef19f4287d19be9bd88af50f337/ai.py#L125
https://github.com/aeduin/snake_ai/blob/2e935fd124a54ef19f4287d19be9bd88af50f337/convnet_ai.py#L161
)
I measured that the total time spend in model.predict is 30 times more than in model.fit, while there are only 3 times as many predictions as actual training examples. My hypothesis would be that this is because the simulation of the game runs on the CPU, and latency of transferring data between CPU and GPU is what causes predicting to be so slow. (Fitting happens at the end of a simulated game, so all the training data is transferred in one go)
A secondary question would be: how do I test this hypothesis? But the main question is: how do I solve this problem? I know how to write CUDA programs. I could write a CUDA implementation of Snake. However, I don't know enough about Tensorflow or any other ML framework to know how to let it use this data without letting the data go through the CPU/RAM first. How would I do that?
One thing that helps is increasing the amount of simultaneously simulated games. The 30 times factor was with 512 or 256 (I don't remember which one) simultaneous games, and with lower counts, it gets much worse (50 times with only 2 simultaneous games). But I don't expect more than 512 simultaneous games adds a lot of value.
I've also tried running the convolutional neural network on the CPU. While the ratio is more reasonable (predicting is only 4 times slower than fitting), fitting and predicting both take more time than when using the GPU.
I want to know how beneficial it would be if we could reduce the number of back propagation steps by 50%.
For example, let's say a neural network performed back propagation 1000 times for training. And another neural network performs back propagation 500 to get trained (Lets assume that both of them gave same accuracy after training). Will the second one be significantly faster? Or does it not matter much? It will increase the speed of training.
If you can train two networks, to the same accuracy, but one of them only needs to process half as much data, then yes that is a good thing.
The resulting network will not be any faster to execute during inference time, but there are still several important benefits to the training process.
Training will take half as long. This is valuable by itself. It is extra valuable when you consider that you can now try twice as many ideas in the same amount of time. That will improve results quality for the entire process.
Faster convergence can reduce generalization error and overfitting. The optimization does not have as many opportunities to "fidget" and find opportunities to overfit.
Extremely fast convergence, called super-convergence, can improve the final training error while still keeping generalization error low, leading to better validation scores too.
Speaking more generally, there is a lot of research and other activity on the topic of how to make networks train as quickly and cheaply as possible. One such benchmark is DAWNBench, which sets a target accuracy to achieve and then ranks approaches based on how fast they reach that target, and how much the GPUs or other infrastructure cost to do it.
This general idea of "cost reduction" is also one of the drivers behind the general idea of Transfer Learning.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on a tensorflow project, in which I have a neural network in a reinforcement learning system, used to predict the Q values. I have 50 inputs and 10 outputs. Some of the inputs are in the range 30-70, and the rest are between 0-1, so I normalize only the first group, using this formula:
x_new = (x - x_min)/(x_max - x_min)
Although I know the mathematical base of neural networks, I do not have experience applying them in real cases, so I do not really know if the hyperparameters I am using are correctly chosen. The ones I have currently are:
2 hidden layers with 10 and 20 neurons each
Learning rate of 0.5
Batch size of 10 (I have tried with different values until 256 obtaining the same result)
The problem I'm not able to solve is that the weights of this neural network only change in the first two or three iterations, and stay fixed afterwards.
What I had read in other posts is that the algorithm is finding a local optima, and that the normalization of the inputs is a good idea to solve it. However, after normalizing the inputs, I am still in the same state. So, my question is if anyone knows where the problem may be, and if there is any other technique (like normalization) that I should add to my pipeline.
I haven't added any line of code in the question, because I think my problem is rather conceptual. However, in case more details were needed, I would insert it.
Some pointers you can check:
50 input data points with 10 classes?... The data is too small for the netowrk to learn anything at all if this is the case
Which activation function are you using. Try ReLU instead of sigmoid or tanh:
activation functions
How deep is your network? maybe all your graident is either vanishing or exploding:
vanishing or exploding gradients
check if your training data overfits. if not your network is not learning anything
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to design a CNN for a binary image classification task, which is to detect a small object present or absent in the images. The images are greyscale (unsigned short) with size 512x512 (dowmsampled from 2048x2048 already), and I have thousands of those images for training and test.
It's my first time using CNN for this kind of task, and I hope to achieve ~80% accuracy to start, so I'd like to know, IN GENERAL, how to design the CNN such that I have the best chance to achieve my goal.
My specific questions are:
How many convolution layers and fully-connected layers should I use?
How many feature maps are in each convolution layer and how many nodes in each fully-connected layer?
What's the filter size in each convolution layer?
I'm trying to implement the CNN using Keras with TensorFlow backend, and my computer's specs are: 8 Intel Xeon CPUs # 3.5 GHz; 32 GB memory; 2 Nvidia GPUs: GeForce GTX 980 and Quadro K4200
With those hardware and software, I'd also like to know the computational time of the training. Specifically,
How long will it take to train the CNN (with above structure) with 1000 images mentioned above in epoch, and (in general) how many epochs are needed to achieve ~80% accuracy?
The reason I want to know the typical computational time is to make sure I set up everything properly.
Hope I didn't ask too many questions in my first post.
You'd probably go very well if you take one of the already existing models that keras makes available for that task, such as VGG16, VGG19, InceptionV3 and others: https://keras.io/applications/.
You may experiment on them, try different paramters, little tweaks here and there, and stuff like that. Since you've got only one class, you can probably try smaller versions of them.
All the codes can be found in https://github.com/fchollet/keras/tree/master/keras/applications
Speed is very very relative. It's impossible to tell the speed because each installation method, each driver, each version, each operational system may or may not actually use your hardware capabilities properly or entirely.
But with your specifications, it should be pretty fast, if everything is set up well.