Misclassification using VGG16 Net - tensorflow

I use Faster RCNN to classify 33 items. But most of them are misclassified among each other. All items are snack packets and sweet packets like in the link below.
So color and shape are similar.
What could be the best way to solve this misclassification problem?

Fine tuning is a way to use features, learned on some big dataset, in our problem, which means instead of training the complete network again, we freeze out weights of the lower layer of the network and add few layers at the end of network, as per requirement. Now we train it on our data-set again. So the advantage here is that, we don't need to train all-millions of parameters, but few only. Another is that we don't need large-dataset to fine-tune.
More you can find here. This is another-useful resource, where author has explained this in more detail(with code).
Note: This is also known as transfer-learning.


Change the spatial input dimension during training

I am training a yolov4 (fully convolutional) in tensorflow 2.3.0.
I would like to change the spatial input shape of the network during training, to further adjust the weights to different scales.
Is this possible?
I know of the existence of darknet, but it suffers from some very specific augmentations I use and have implemented in my repo, that is why I ask explicitly for tensorflow.
To be more precisely about what I want to do.
I want to train for several batches at Y1xX1xC then change the input size to Y2xX2xC and train again for several batches and so on.
It is not possible. In the past people trained several networks for different scales but the current state-of-the-art approach is feature pyramids.
Another great candidate is to use dilated convolution which can learn long distance dependencies among pixels with varying distance. You can concatenate the outputs of them and the model will then learn which distance is important for which case
It's important to mention which TensorFlow repository you're using. You can definitely achieve this. The idea is to keep the fixed spatial input dimension in a single batch.
But even better approach is to use the darknet repository from AlexeyAB: https://github.com/AlexeyAB/darknet
Just set, random = 1 https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4.cfg [line 1149]. It will train your network with different spatial dimensions randomly.
One thing you can do is, start your training with AlexeyAB repo with random=1 set, then take the trained weights file to tensorflow for fine-tuning.

Deep learning training with nonidentical images?

[![enter image description here][1]][1]I am actually reconstructing some images using dual photography. Next, I want to train a network to reconstruct clear images by removing noise (Denoising autoencoder).
The input for training the network is reconstructed images, whereas, the output is ground truth or computer based standard test images. Now the input e.g., Lena is some how not exact version of Lena with image shifted in positions and some artifacts.
If I keep input as my reconstructed image and training output as Lena test image (computer standard test image) , will it work?
I only want to know if input/output shifted or some details missing in one of them (due to some cropping) would work.
It depends on many factors like your images for training and the architecture of the network.
However, what you want to do is to make a network that learns the noise or low level information and for this purpose Generative Adversarial Networks (GAN) are very popular. You can read about them here. Maybe, after you have tried your approach and if the results are not satisfactory then try using GANs, like, DCGAN (Deep Convolution GAN).
Also, share your outcomes with the community if you would like.
Denoising Autoencoders! Love it!
There is no reason for not training your model with those images. The autoencoder, if well trained, will eventually learn the transformation if there is enough data.
However, if you have the 'positive' images, I strongly recommend you to create your own noisy images and then train in that controlled working area. You will simplify your problem and it will be easier to solve.
What is stopping you from doing just that?

Object detection project (root architecture) using Tensorflow + Keras. Image sample size for accurate training of model?

Im currenty working on a project at University, where we are using python + tensorflow and keras to train an image object detector, to detect different parts of the root system of Arabidopsis.
Our current ressults are pretty bad, as we do only have about 100 images to train the model with at this moment, but we are currently working on cultuvating more plants in order to get more images(more data) to train the tensorflow model.
We have implemented the following Mask_RCNN model:Github- Mask_RCNN tensorflow
We are looking to detect three object clases: stem, main root and secondary root.
But the model detects main roots incorrectly where the secondary roots are located.
It should be able to detect something like this:Root detection example
Training root data set that we are using right now:training images
What is the usual sample size that is used to train a neural network accurate results?
First off: I think there is no simple rule to estimate the sample size but at least it depends on:
1. Quality of your images
I downloaded the images and I think you need to preprocess them before you can use it to reduce the "problem complexity". In some projects, in which I worked with biological data, a background removal (image - low pass filter) was the key to get better results. But you should definitely remove/crop the area outside the region of your interest (like the tape and the ruler). I would try to get the cleanest data set as possible (including manually adjustments cv2/ gimp/ etc.) to focus the network to solve "the right problem".. After that you could apply some random distortion to make it also work on fuzzy/bad/realistic images as well.
2. The way you work with your data
There are a few tricks that enables you to "expand" your dataset.
Sometimes it's very helpful to let a generator method crop random small patches from your input data. This allows you to work with more batches (on small gpus) and gives your network more "variety", (just think about the conv2d task: if you don't use random cropping your filters will slide over the same areas over and over again (at the same image)). Because of the same reason: apply random distortion, flip and rotate your images.
3. Network architecture
In your case I would prefer a U-Net architecture with a last conv2d output of 3 (your classes) feature maps, a final softmax activation and an categorical_crossentropy, this enables you to play with the depth, because sometimes you need sophisticated architectures to solve a problem (close to 100%) but in your case you just want to see a first working result. So fewer layers and a simple architecture could also help you to get things work. Maybe there are some trained network weights for a U-Net which meets your requirements (search on kaggle for example). Because it is also helpful (to reduce the data you need) to use "transfer learning" -> use the first layers of an network (weights) which is already trained. Using a semantic segmentation the first filters will become something like an edge detection for the most given problems/images.
4. Your mental model of "accurate results"
This is the hardest part.. because it evolves during your project. Eg. in the same moment your networks starts to perform well on preprocessed input images you will start to think about architecture/data changes to make it work on fuzzy images as well. This is why you should start with a feasible problem but always improve your dataset (including rare kinds of roots) and tune your network architecture step by step.

How can I enrich a Convolutional Neural Network with meta information?

I would very much like to understand how I can enrich a CNN with provided meta information. As I understand, a CNN 'just' looks at the images and classifies it into objects without looking at possibly existing meta-parameters such as time, weather conditions, etc etc.
To be more precise, I am using a keras CNN with tensorflow in the backend. I have the typical Conv2D and MaxPooling Layers and a fully connected model at the end of the pipeline. It works nicely and gives me a good accuracy. However, I do have additional meta information for each image (the manufacturer of the camera with which the image was taken) that is unused so far.
What is a recommended way to incorporate this meta information into the model? I could not yet come out with a good solution by myself.
Thanks for any help!
Usually it is done by adding this information in one of the fully connected layer before the prediction. The fully connected layer gives you K features representing your image, you just concatenate them with the additional information you have.

Process to build our own model for image detection

Currently, I am working on deep neural network for image detection and I founded a model called YOLO Network, and it's very powerful to make objects detections, but I have a question:
How can we design and concept our own model? Do we use a brut force for that, for example "I use 2 convolutional and 1 pooling layer and 1 fully connected layer" after that if the result is'nt good I change the number of layers and change the parameter until I find the best model, Please if there is anyone who knows some informations about that, show me how ?
I use Tensorflow.
There are a couple of papers addressing this issue. For example in http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf some general principles are mentioned, like preserving information by not having too rapid changes in any cut of the graph seperating the output from the input.
Another paper is https://arxiv.org/pdf/1606.02228.pdf where specific hyperparameter combinations are tried.
The remainder are just what you observe in practice and depends on your dataset and on your requirement. Maybe you have performance requirements because you want to deploy to mobile or you need more than 90 % accuracy. Then you will have to choose your model accordingly.