In CNN, should we have more layers with less filters or less layers with large number of filters? - optimization

I am trying to understand the effect of adding more layers vs the effect of increasing the number of filters in the existing layers in a CNN. Consider a CNN with 2 hidden layers and 64 filters each. Consider a second CNN with 4 hidden layers and 32 filters each. The kernel sizes are same in both CNNs. The number of parameters are also same in both cases.
Which one should I expect to perform better? I am thinking in terms of hyperparameter tuning, batch sizes, training time etc.

In CNNs the deeper layers correspond to higher-level features in images. In the first layer, you are going to get low-level features such as edges, lines, etc. So the network uses these layers to create more complex features; using edges to find circles, using circles to find wheels, using wheels to find a car (I skipped a couple of steps but you get the gist).
So to answer your question, you need to consider how complex your problem is. If you are working on something like imagenet, you can expect model with more layers to have the edge. On the other hand, for problems like mnist, you don't need high-level features.

Related

Is it possible to train a CNN on a dataset and test it on another dataset with different classes?

I am new to deep learning, and I am doing a research using CNNs. I need to train a CNN model on a dataset of images (landmark images) and test the same model using a different dataset (landmark images too). One of the motivations is to see the ability of the model to generalize. But the problems is: Since the dataset used for train and test is not the same, the classes are not the same! Possibly, the number of classes too, which means that the predictions made on the test dataset are not trust worthy (Since the weights of the output layer have been calculated based on different classes belonging to train dataset). Is there any way to evaluate a model on a different dataset without affecting test accuracy?
The performance of a neural network on one dataset will not generally be the same as its performance on another. Images in one dataset can be more difficult to distinguish than those in another. As a rule of thumb: if your landmark datasets are similar, it's likely that performance will be similar. However, this is not always the case: subtle differences between the datasets can result in significantly different performance.
You can account for the potentially different performance on the two datasets by training another network on the other dataset. This will give you a baseline of what to expect when you try to generalize your network to it.
You can apply your neural network trained for one set of classes to another set of classes. There are two main approaches to this:
Transfer learning. This is where the last layer of your trained network is replaced with a new layer(s) that is trained, by itself, to classify the new images. (Use for many classes. Can use for few classes.)
All-Transfer learning. Rather than replacing the last layer, add a new layer after it and only train the final layers. (Use for few classes.)
Both approaches are much quicker than training a neural network from scratch.
I assume that you are facing a classification problem.
What do you explicitly mean? Do you have classes A B and C in your train-dataset and the same classes in your test-dataset with a different labeling, or do you have completly different classes in your test-dataset with respect to your train-dataset?
You can solve the first problem by creating a mapping from trainlabel to testlabel or vice versa.
The second one depends on what you are trying to achieve... If you want the model to predict classes, which were never trained, you wont get any outcome.

Object detection project (root architecture) using Tensorflow + Keras. Image sample size for accurate training of model?

Im currenty working on a project at University, where we are using python + tensorflow and keras to train an image object detector, to detect different parts of the root system of Arabidopsis.
Our current ressults are pretty bad, as we do only have about 100 images to train the model with at this moment, but we are currently working on cultuvating more plants in order to get more images(more data) to train the tensorflow model.
We have implemented the following Mask_RCNN model:Github- Mask_RCNN tensorflow
We are looking to detect three object clases: stem, main root and secondary root.
But the model detects main roots incorrectly where the secondary roots are located.
It should be able to detect something like this:Root detection example
Training root data set that we are using right now:training images
What is the usual sample size that is used to train a neural network accurate results?
First off: I think there is no simple rule to estimate the sample size but at least it depends on:
1. Quality of your images
I downloaded the images and I think you need to preprocess them before you can use it to reduce the "problem complexity". In some projects, in which I worked with biological data, a background removal (image - low pass filter) was the key to get better results. But you should definitely remove/crop the area outside the region of your interest (like the tape and the ruler). I would try to get the cleanest data set as possible (including manually adjustments cv2/ gimp/ etc.) to focus the network to solve "the right problem".. After that you could apply some random distortion to make it also work on fuzzy/bad/realistic images as well.
2. The way you work with your data
There are a few tricks that enables you to "expand" your dataset.
Sometimes it's very helpful to let a generator method crop random small patches from your input data. This allows you to work with more batches (on small gpus) and gives your network more "variety", (just think about the conv2d task: if you don't use random cropping your filters will slide over the same areas over and over again (at the same image)). Because of the same reason: apply random distortion, flip and rotate your images.
3. Network architecture
In your case I would prefer a U-Net architecture with a last conv2d output of 3 (your classes) feature maps, a final softmax activation and an categorical_crossentropy, this enables you to play with the depth, because sometimes you need sophisticated architectures to solve a problem (close to 100%) but in your case you just want to see a first working result. So fewer layers and a simple architecture could also help you to get things work. Maybe there are some trained network weights for a U-Net which meets your requirements (search on kaggle for example). Because it is also helpful (to reduce the data you need) to use "transfer learning" -> use the first layers of an network (weights) which is already trained. Using a semantic segmentation the first filters will become something like an edge detection for the most given problems/images.
4. Your mental model of "accurate results"
This is the hardest part.. because it evolves during your project. Eg. in the same moment your networks starts to perform well on preprocessed input images you will start to think about architecture/data changes to make it work on fuzzy images as well. This is why you should start with a feasible problem but always improve your dataset (including rare kinds of roots) and tune your network architecture step by step.

How to arrange different layers in CNN

I searched many articles about convolutional neural networks and found that there are some good structures that I can refer to. For example, AlexNet, VGG, GoogleNet.
However, if I want to customize CNN architecture by myself, how to arrange/order different layers? E.g. convolution layer, dropout, max pooling... Is there any standard? or just keep trying different combination to produce the good result?
According to me there isn't a standard per say,But combinations
1-Like if you want to create a deeper network you can use residual block to avoid facing vanishing gradient problem.
2-The standard of using a 3,3 convolution is because it reduces computational cost ex 3 simultaneous 3,3 convolution can achieve a 7,7 convolution for a smaller cost
3-The main reason for dropout is to introduce regularization ,which can also be achieved by batch normalization as the author claims.
4-Before what to enhanced and how to enhanced ,one must understand the problem he/she is trying to solve.
You can go through the case study which was taught at Standford
Standford case study
The video can help you understand much of these combinations and how they result in model improvement and can help you built your network.
You generally want to put a pooling layer after a convolutional layer. Also, you can think of dropout as a parameter that is applied to a layer, and not a separate layer altogether -- whichever is easier for you to envision.

Image Classification: Heavily unbalanced data over thousands of classes

I have a dataset consist of around 5000 categories of images, but the number of images of every category varies from 20 to 2000, which is quite unbalanced. Also, the number of images are far from enough to train a model from scratch. I decided to do finetuning on pretrained models, like Inception models.
But I am not sure about how to deal with unbalanced data. There are several possible approaches:
Oversampling: Oversample the minority category. But even with aggressive image augmentation technique, we may not be able to deal with overfit.
Also, how to generate balanced batches from unbalanced dataset over so many categories? Do you have some ideas about this pipeline mechanism with TensorFlow?
SMOTE: I think it is not so effective for high dimensional signals like images.
Put weight on cross entropy loss in every batch. This might be useful for single batch, but cannot deal with the overall unbalance.
Any ideas about this? Any feedback will be appreciated.
Use tf.losses.softmax_cross_entropy and set weights for each class inversely proportional to their training frequency to "balance" the optimization.
Start with the pre-trained ImageNet layers, add your own final layers (with appropriate convolution, drop out and flatten layers as required). Freeze all but last few of the ImageNet layers, then train on your dataset.
For unbalanced data (and in general small datasets), use data augmentation to create more training images. Keras has this functionality built-in: Building powerful image classification models using very little data

How to make a model of 10000 Unique items using tensorflow? Will it scale?

I have a use case where I have around 100 images each of 10000 unique items. I have 10 items with me which are all from the 10000 set and I know which 10 items too but only at the time of testing on live data. I have to now match the 10 items with their names. What would be an efficient way to recognise these items? I have full control of training environment background and the testing environment background. If I make one model of all 10000 items, will it scale? Or should I make 10000 different models and run the 10 items on the 10 models I have pretrained.
Your question is regarding something called "one-vs-all classification" you can do a google search for that, the first hit is a video lecture by Andrew Ng that's almost certainly worth watching.
The question has been long studied and in a plethora of contexts. The answer to your question does very much depend on what model you use. But I'll assume that, if you're doing image classification, you are using convolutional neural networks, because, after all, they're state of the art for most such image classification tasks.
In the context of convolutional networks, there is something called "Multi task learning" that you should read up on. Boiled down to a single sentence, the concept is that the more you ask the network to learn the better it is at the individual tasks. So, in this case, you're almost certain to perform better training 1 model on 10,000 classes than 10,000 classes each performing a one-vs-all classification scheme.
Take for example the 1,000 class Imagenet dataset, and CIFAR-10's 10 class dataset. It has been demonstrated in numerous papers that first training against Imagenet's 1,000 class dataset, and then simply replacing the last layer with a 10 class output and re-training on CIFAR-10's dataset will produce a better result than just training on CIFAR-10's dataset alone. There are admittedly multiple reasons for this result, Imagenet is a larger dataset. But the richness of class labels, multi-task learning, in the Imagenet dataset is certainly among the reasons for this result.
So that was a long winded way of saying, use one model with 10,000 classes.
An aside:
If you want to get really, really interesting, and jump into the realm of research level thinking, you might consider a 1-hot vector of 10,000 classes rather sparse and start thinking about whether you could reduce the dimensionality of your output layer using an embedding. An embedding would be a dense vector, let's say size 100 as a good starting point. Now class labels turn into clusters of points in your 100 dimensional space. I bet your network will perform even better under these conditions.
If this little aside didn't make sense, it's completely safe to ignore it, your 10,000 class output is fine. But if it did peek your interest look up information on Word2Vec, and read this really nice post on how face recognition is achieved using embeddings: https://medium.com/#ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78. You might also consider using an Auto Encoder to generate an embedding for the images (though I favor triplet embeddings as typically used in face recognition myself).