Image Classification: Heavily unbalanced data over thousands of classes - tensorflow

I have a dataset consist of around 5000 categories of images, but the number of images of every category varies from 20 to 2000, which is quite unbalanced. Also, the number of images are far from enough to train a model from scratch. I decided to do finetuning on pretrained models, like Inception models.
But I am not sure about how to deal with unbalanced data. There are several possible approaches:
Oversampling: Oversample the minority category. But even with aggressive image augmentation technique, we may not be able to deal with overfit.
Also, how to generate balanced batches from unbalanced dataset over so many categories? Do you have some ideas about this pipeline mechanism with TensorFlow?
SMOTE: I think it is not so effective for high dimensional signals like images.
Put weight on cross entropy loss in every batch. This might be useful for single batch, but cannot deal with the overall unbalance.
Any ideas about this? Any feedback will be appreciated.

Use tf.losses.softmax_cross_entropy and set weights for each class inversely proportional to their training frequency to "balance" the optimization.

Start with the pre-trained ImageNet layers, add your own final layers (with appropriate convolution, drop out and flatten layers as required). Freeze all but last few of the ImageNet layers, then train on your dataset.
For unbalanced data (and in general small datasets), use data augmentation to create more training images. Keras has this functionality built-in: Building powerful image classification models using very little data

Related

How to control the amount of the augmented training data with the augmentation layers integrated into the model

As you may know, recent versions of tensorflow/keras allowed the data augmentation layers integrated into the model. This feature of the API is an excellent option, especially when you want to apply image augmentation on a part of inputs (image) for a model with multimodal inputs and different sub-networks for different inputs. And the test accuracy with this augmentation increased to 3-5% in comparison with no augmentation.
But I can't figure out how many training samples were used in the actual training with this augmentation method. For simplicity, let's assume I am passing a list of numpy arrays as the inputs of the model when fitting the model. For example, if I have 1000 training cases for a model with the augmentation layers, will 1000 training cases with transformed images be used in training? If not, how many?
I tried to search all related sites (tutorials and documentation) for an answer to this simple question in vain.
I think I found the answer. Based on the training log of the model, the augmentation layers do not produce additional images but randomly transform the original images. To increase generated data amount, a user has to provide multiple copies of original training data as input to the model.

Change the spatial input dimension during training

I am training a yolov4 (fully convolutional) in tensorflow 2.3.0.
I would like to change the spatial input shape of the network during training, to further adjust the weights to different scales.
Is this possible?
EDIT:
I know of the existence of darknet, but it suffers from some very specific augmentations I use and have implemented in my repo, that is why I ask explicitly for tensorflow.
To be more precisely about what I want to do.
I want to train for several batches at Y1xX1xC then change the input size to Y2xX2xC and train again for several batches and so on.
It is not possible. In the past people trained several networks for different scales but the current state-of-the-art approach is feature pyramids.
https://arxiv.org/pdf/1612.03144.pdf
Another great candidate is to use dilated convolution which can learn long distance dependencies among pixels with varying distance. You can concatenate the outputs of them and the model will then learn which distance is important for which case
https://towardsdatascience.com/review-dilated-convolution-semantic-segmentation-9d5a5bd768f5
It's important to mention which TensorFlow repository you're using. You can definitely achieve this. The idea is to keep the fixed spatial input dimension in a single batch.
But even better approach is to use the darknet repository from AlexeyAB: https://github.com/AlexeyAB/darknet
Just set, random = 1 https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4.cfg [line 1149]. It will train your network with different spatial dimensions randomly.
One thing you can do is, start your training with AlexeyAB repo with random=1 set, then take the trained weights file to tensorflow for fine-tuning.

BERT + custom layer training performance going down with epochs

I'm training a classification model with custom layers on top of BERT. During this, the training performance of this model is going down with increasing epochs ( after the first epoch ) .. I'm not sure what to fix here - is it the model or the data?
( for the data it's binary labels, and balanced in the number of data points for each label).
Any quick pointers on what the problem could be? Has anyone come across this before?
Edit: Turns out there was a mismatch in the transformers library and tf version I was using. Once I fixed that, the training performance was fine!
Thanks!
Remember that fine-tuning a pre-trained model like Bert usually requires a much smaller number of epochs than models trained from scratch. In fact the authors of Bert recommend between 2 and 4 epochs. Further training often translates to overfitting to your data and forgetting the pre-trained weights (see catastrophic forgetting).
In my experience, this affects small datasets especially as it's easy to overfit on them, even at the 2nd epoch. Besides, you haven't commented on your custom layers on top of Bert, but adding much complexity there might increase overfitting also -- note that the common architecture for text classification only adds a linear transformation.

Number of images in each classes

Currently, I am training a model for image detection, and I want to know how many images do I need per class, do i need to have the same numbers of each object.
Please i need some advice.
I use Tensorflow, and Yolo v2 model.
Thanks,
You need as many as you can get, but definitely in the order of tens of thousands at least if you're training the network from scratch (there are pre-trained weights for YOLOv2 trained on
- http://host.robots.ox.ac.uk/pascal/VOC/).
It's best to have balanced classes, meaning the number of images for each class must be close, it's easier to train this way.
Why are you training the network yourself? Can't you use some pre-trained models, drop the FC layers and insert your own classes? This way it's much faster, and you don't need that many images.

How filters are initialized in convnet

I read a lot of papers on convnets, but there is one thing I don't understand, how the filters in convolutional layer are initialized ?
Because, for examples, in first layer, filters should detect edges etc..
But if it randomly init, it could not be accurate ? Same for next layer and high-level features.
And an other question, what are the range of the value in those filters ?
Many thanks to you!
You can either initialize the filters randomly or pretrain them on some other data set.
Some references:
http://deeplearning.net/tutorial/lenet.html:
Notice that a randomly initialized filter acts very much like an edge
detector!
Note that we use the same weight initialization formula as with the
MLP. Weights are sampled randomly from a uniform distribution in the
range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a
hidden unit. For MLPs, this was the number of units in the layer
below. For CNNs however, we have to take into account the number of
input feature maps and the size of the receptive fields.
http://cs231n.github.io/transfer-learning/ :
Transfer Learning
In practice, very few people train an entire Convolutional Network
from scratch (with random initialization), because it is relatively
rare to have a dataset of sufficient size. Instead, it is common to
pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 1.2 million images with 1000 categories), and then use the
ConvNet either as an initialization or a fixed feature extractor for
the task of interest. The three major Transfer Learning scenarios look
as follows:
ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer (this layer's outputs
are the 1000 class scores for a different task like ImageNet), then
treat the rest of the ConvNet as a fixed feature extractor for the new
dataset. In an AlexNet, this would compute a 4096-D vector for every
image that contains the activations of the hidden layer immediately
before the classifier. We call these features CNN codes. It is
important for performance that these codes are ReLUd (i.e. thresholded
at zero) if they were also thresholded during the training of the
ConvNet on ImageNet (as is usually the case). Once you extract the
4096-D codes for all images, train a linear classifier (e.g. Linear
SVM or Softmax classifier) for the new dataset.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new
dataset, but to also fine-tune the weights of the pretrained network
by continuing the backpropagation. It is possible to fine-tune all the
layers of the ConvNet, or it's possible to keep some of the earlier
layers fixed (due to overfitting concerns) and only fine-tune some
higher-level portion of the network. This is motivated by the
observation that the earlier features of a ConvNet contain more
generic features (e.g. edge detectors or color blob detectors) that
should be useful to many tasks, but later layers of the ConvNet
becomes progressively more specific to the details of the classes
contained in the original dataset. In case of ImageNet for example,
which contains many dog breeds, a significant portion of the
representational power of the ConvNet may be devoted to features that
are specific to differentiating between dog breeds.
Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release
their final ConvNet checkpoints for the benefit of others who can use
the networks for fine-tuning. For example, the Caffe library has a
Model Zoo where people
share their network weights.
When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of
several factors, but the two most important ones are the size of the
new dataset (small or big), and its similarity to the original dataset
(e.g. ImageNet-like in terms of the content of images and the classes,
or very different, such as microscope images). Keeping in mind that
ConvNet features are more generic in early layers and more
original-dataset-specific in later layers, here are some common rules
of thumb for navigating the 4 major scenarios:
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to
overfitting concerns. Since the data is similar to the original data,
we expect higher-level features in the ConvNet to be relevant to this
dataset as well. Hence, the best idea might be to train a linear
classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won't overfit
if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a
linear classifier. Since the dataset is very different, it might not
be best to train the classifier form the top of the network, which
contains more dataset-specific features. Instead, it might work better
to train the SVM classifier from activations somewhere earlier in the
network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can
afford to train a ConvNet from scratch. However, in practice it is
very often still beneficial to initialize with weights from a
pretrained model. In this case, we would have enough data and
confidence to fine-tune through the entire network.
Practical advice. There are a few additional things to keep in mind when performing Transfer Learning:
Constraints from pretrained models. Note that if you wish to use a pretrained network, you may be slightly constrained in terms of the
architecture you can use for your new dataset. For example, you can't
arbitrarily take out Conv layers from the pretrained network. However,
some changes are straight-forward: Due to parameter sharing, you can
easily run a pretrained network on images of different spatial size.
This is clearly evident in the case of Conv/Pool layers because their
forward function is independent of the input volume spatial size (as
long as the strides "fit"). In case of FC layers, this still holds
true because FC layers can be converted to a Convolutional Layer: For
example, in an AlexNet, the final pooling volume before the first FC
layer is of size [6x6x512]. Therefore, the FC layer looking at this
volume is equivalent to having a Convolutional Layer that has
receptive field size 6x6, and is applied with padding of 0.
Learning rates. It's common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the
(randomly-initialized) weights for the new linear classifier that
computes the class scores of your new dataset. This is because we
expect that the ConvNet weights are relatively good, so we don't wish
to distort them too quickly and too much (especially while the new
Linear Classifier above them is being trained from random
initialization).
Additional References
CNN Features off-the-shelf: an Astounding Baseline for Recognition trains SVMs on features
from ImageNet-pretrained ConvNet and reports several state of the art
results.
DeCAF reported similar findings in 2013. The framework in this paper (DeCAF) was a Python-based precursor to the C++ Caffe library.
How transferable are features in deep neural networks? studies the transfer
learning performance in detail, including some unintuitive findings
about layer co-adaptations.