How to train large datasets in Colab free - tensorflow

I have to train 70,000 images for my face verification project on google colab free.
First, it gets stuck on 1st epoch and then even if it starts training, after sometime it throws out of RAM error.
The code I use is:
<https://nbviewer.org/github/nicknochnack/FaceRecognition/blob/main/Facial%20Verification%20with%20a%20Siamese%20Network%20-%20Final.ipynb>
If I've to make mini-batches of my dataset to fit it in the colab's GPU memory, then how can I do it?
Also, I want to train the whole dataset because it contains the images of 5 different people as anchors and positives.

You can do following options to train larger datasets.
Add more pooling layers in model.
Lower input size in your model.
Use Binary Format of images with lower image size for image classification models.
Lower the batch size while training and validating your model.
You can also use tf.data api to do various operations like batching , slicing , processing, shuffling etc to create a data pipeline. You can constrain GPU usage further to avoid Out of memory issues.
Attaching sample colab notebook below. https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/guide/data.ipynb

Related

Training SSD-MOBILENET V1 and the loss does not deacrease

I'm new in everithing about CNN and tensorflow. Im training a pretrained ssd-mobilenev1-pets.config to detect columns of buildings, about one day but the loss is between 2-1 and doesnt decrease since 10 hours ago.
I realized that my input images are 128x128 and SSD resize de image to 300*300.
Does the size of the input images affect the training?
If that is the case, should I retrain the network with larger input images? or what would be another option to decrease the loss? my train dataset has 660 images and test 166 I dont Know if there are enough images
I really aprecciate your help ....
Loss values of ssd_mobilenet can be different from faster_rcnn. From EdjeElectronics' TensorFlow Object Detection Tutorial:
For my training on the Faster-RCNN-Inception-V2 model, it started at
about 3.0 and quickly dropped below 0.8. I recommend allowing your
model to train until the loss consistently drops below 0.05, which
will take about 40,000 steps, or about 2 hours (depending on how
powerful your CPU and GPU are). Note: The loss numbers will be
different if a different model is used. MobileNet-SSD starts with a
loss of about 20, and should be trained until the loss is consistently
under 2.
For more information: https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10#6-run-the-training
The SSD Mobilnet architecture demands additional training to suffice
the loss accuracy values of the R-CNN model, however, offers
practicality, scalability, and easy accessibility on smaller devices
which reveals the SSD model as a promising candidate for further
assessment (Fleury and Fleury, 2018).
For more information: Fleury, D. & Fleury, A. (2018). Implementation of Regional-CNN and SSD machine learning object detection architectures for the real time analysis of blood borne pathogens in dark field microscopy. MDPI AG.
I would recommend you to take 15%-20% images for testing which cover all the variety present in training data. As you said you have 650+ images for training and 150+ for testing. That is roughly 25% of testing images. It looks like you have enough images to start with. I know the more, the merrier but make sure your model also has sufficient data to learn from!
Resizing the images does not contribute to the loss. It makes sure there is consistency across all images for the model to recognize them without bias. The loss has nothing to do with image resizing as long as every image is resized identically.
You have to make stops and recover checkpoints again and again if you want your model to be perfectly fit. Usually, you can get away with good accuracy by re-training the ssd mobilenet until the loss consistently becomes under 1.Ideally we want the loss to be as lower as possible but we want to make sure the model is not over-fitting. It is all about trial and error. (Loss between 0.5 and 1 seems to be doing the job well but again it all depends on you.)
The reason I think your model is underperforming is due to the fact that you have variety of testing data and not enough training data to suffice.
The model has not been given enough knowledge in training data to make the model learn for new variety of testing data. (For example : Your test data has some images of new angles of buildings which are not sufficiently present in training data). In that case, I recommend you to put variety of all images in training data and then picking images to test making sure you still have sufficient training data of new postures. That's why I recommend you to take 15%-20% test data.

Faster RCNN + inception v2 input size

What is the input size of faster RCNN RPN?
I'm using an object detection API of Tensorflow which is using faster RCNN as region proposal network ( RPN ) and Inception as feature extractor ( according to the config file ). The API is using the online approach in prediction phase and detects every input image singly. however, I'm now trying to feed images to the network in the batch manner by use of Tensorflow dataset API.
as you know for make batch out of the data, firstly we need to resize all of the images to a same size. I think the best way of resizing the images is to resize them exactly to the input size of faster RCNN to avoid duplicate resizing. Now my question is what is the input size of the faster RCNN RPN?
thanks in advance
It depends on the input resolution which was specified in the pipeline config file, in image_resizer.
For example, for Faster R-CNN over InceptionV2 trained on COCO dataset, see this config file.
The specified resolution is 600x1024.
On a side note, fully convolutional architectures (such as RFCN, SSD, YOLO) don't restrict to a single resolution, i.e. you can apply them on different input resolution without modifying the architecture.
But this doesn't mean that the model will be robust to it if you're training on a single resolution.

Tensorflow object detection: why is the location in image affecting detection accuracy when using ssd mobilnet v1?

I'm training a model to detect meteors within a picture of the night sky and I have a fairly small dataset with about 85 images and each image is annotated with a bounding box. I'm using the transfer learning technique starting with the ssd_mobilenet_v1_coco_11_06_2017 checkpoint and Tensorflow 1.4. I'm resizing images to 600x600pixels during training. I'm using data augmentation in the pipeline configuration to randomly flip the images horizontally, vertically and rotate 90 deg. After 5000 steps, the model converges to a loss of about 0.3 and will detect meteors but it seems to matter where in the image the meteor is located. Do I have to train the model by giving examples of every possible location? I've attached a sample of a detection run where I tiled a meteor over the entire image and received various levels of detection (filtered to 50%). How can I improve this?detected meteors in image example
It could very well be your data and I think you are making a prudent move by improving the heterogeneity of your dataset, BUT it could also be your choice of model.
It is worth noting that ssd_mobilenet_v1_coco has the lowest COCO mAP relative to the other models in the TensorFlow Object Detection API model zoo. You aren't trying to detect a COCO object, but the mAP numbers are a reasonable aproximation for generic model accuracy.
At the highest possible level, the choice of model is largely a tradeoff between speed/accuracy. The model you chose, ssd_mobilenet_v1_coco, favors speed over accuracy. Consequently, I would reccomend you try one of the Faster RCNN models (e.g., faster_rcnn_inception_v2_coco) before you spend a signifigant amount of time preprocessing images.

Retraining Inception and Downsampling

I have followed this Tensorflow tutorial on transfer learning with the Inception model using my own dataset of 640x360 images. My question comes in 2 parts
1) My data set conatains 640x360 images. Is the first operation that happens a downsampling to 299x299? I ask because I have a higher res version of the same dataset and I am wondering if training with the higher resolution images will result in different performance (hopefully better)
2) When running the network (using tf.sess.run()) is my input image down-sampled to 299x299?
Note: I have seen the 299x299 resolution stat listed many places online like this one and I am confused at exactly which images its referring to; the initial training dataset images (for Inception I think it was imagenet), the transfer learning dataset (my personal one), the input image when running the CNN, or a combination of the 3.
Thanks in advance :)
The inception model will resize your image to 299x299. This can be confirmed by visualizing the tensorflow graph. If you have enough samples to do the transfer learning, the accuracy will be good enough with resizing to 299x299. But if you really want to try out the training with actual resolution, the initial input layers of the graph size needs to be changed

Tensorflow out of memory

I am using tensorflow to build CNN based text classification. Some of the datasets are large and some are small.
I use feed_dict to feed the network by sampling data from system memory (not GPU memory). The network is trained batch by batch. The batch size is 1024 fixed for every dataset.
My question is:
The network is trained by batches, and each batch the code retrieve data from system memory. Therefore, no matter how large the dataset is the code should handle it like the same, right?
But I got out of memory problem with large dataset, and for small dataset it works fine. I am pretty sure the system memory is enough for holding all the data. So the OOM problem is about tensorflow, right?
Is it that I write my code wrong, or is it about tensorflow's memory management?
Thanks a lot!
I think your batch size is way too big with 1024. There is a lot of matrices overhead created, especially if you use AgaGrad Adam and the like, dropout, attention and/or more. Try smaller values, like 100, as batchsize. Should solve and train just fine.