How many training samples should I take for an object detection model with 62 classes? - tensorflow

I'm trying to train a YOLOv3 model for 62 classes using https://github.com/wizyoung/YOLOv3_TensorFlow.
How many samples should I take for each class.
I'm using a Nvidia GTX 1050Ti GPU so what should be my batch size with each image of 300*300 size?
Is 80-20 train/test split ideal?

The 80-20% train-test(val) split is dependent on the number of samples, not on the number of classes. The more data you have, the bigger the discrepancy percentage between train and test(val) it can be (for millions of samples data you can have 95%---5% split)
Normally, at least (minimum) number of 200 bounding_boxes_annotations per object should be present. That is, each of your classes should have at least 200 annotations.
1050Ti has only 4GB VRAM. Depending on your image_size, you can increase or decrease the batch_size. However, take into consideration that you do not have very much VRAM available, most likely(decrease it to 1 if you have OOM issues) a batch_size of 2 for images of 300x300 will be the maximum you can achieve.

Related

train and test set for a ML algorithm

I have a model which is trained on 33 datasets with SVM using LOOCV. I collected another 13 datasets which I divide like leave one out. In the testing phase, I combine datasets from training (33) and 12 from test and have a model which is trained on 45 datasets and test on the remaining datasets iteratively (similar to LOOCV). Is this method of testing right? All the recordings are independent of each other and can be reoffered as IID.
No, LOOCV is only used for small datasets or when you want an accurate estimate of your model performance.
Let's say your train accuracy is 90%, your test accuracy may be 50%.
This is due to overfitting from the large train size and small test size.
Image of overfitting in ML models
Assuming your 45 dataset sizes are the same, your train test size will be 98% - 2%.
The general rule of thumb for train test size is 80% - 20%
You could use train_test_split, k-fold split, stratifiedshufflesplit etc. instead.

TensorFlow image classification colab sheet from training material: newbie questions

Apologies if my questions are relatively simple, but I have been approaching the TensorFlow bit recently with the aim to learn new skills.
In the example, but there are several things I can't get:
in the explore data section, the size of the datasets return as 60/10k respectively for train and test.
where the size of the train/test size declared?
packages like SkLearn allows this to be specified in percentage when invoking the split methods.
in the training model part, when the 5 epochs are trained, the 1875 number appear below.
- what is that?
- I was expecting the training to run over the 60k items, but even by multiplying 1875 by 5 the number doesn't reach the 10k.
Dataset is loaded using tensorflow datasets API
The source itself has the split of 60K (Train) and 10K (Test)
https://www.tensorflow.org/datasets/catalog/fashion_mnist
An Epoch is a complete run with all the training samples. The training is done in batches. In the example you refer to, a batch size of 32 is used. So to complete one epoch, 1875 batches (60000 / 32) are run.
Hope this helps.

Time taken to train Resnet on CIFAR-10

I was writing a neural net to train Resnet on CIFAR-10 dataset.
The paper Deep Residual Learning For Image Recognition mentions training for around 60,000 epochs.
I was wondering - what exactly does an epoch refer to in this case? Is it a single pass through a minibatch of size 128 (which would mean around 150 passes through the entire 50000 image training set?
Also how long is this expected to take to train(assume CPU only, 20-layer or 32-layer ResNet)? With the above definition of an epoch, it seems it would take a very long time...
I was expecting something around 2-3 hours only, which is equivalent to about 10 passes through the 50000 image training set.
The paper never mentions 60000 epochs. An epoch is generally taken to mean one pass over the full dataset. 60000 epochs would be insane. They use 64000 iterations on CIFAR-10. An iteration involves processing one minibatch, computing and then applying gradients.
You are correct in that this means >150 passes over the dataset (these are the epochs). Modern neural network models often take days or weeks to train. ResNets in particular are troublesome due to their massive size/depth. Note that in the paper they mention training the model on two GPUs which will be much faster than on the CPU.
If you are just training some models "for fun" I would recommend scaling them down significantly. Try 8 layers or so; even this might be too much. If you are doing this for research/production use, get some GPUs.

GPU temperature reads 88 C when training a LSTM on tensorflow

I've got a 1 layer LSTM model in tensorflow and the temperature reading of my GPU gets rather high during the training phase. Always varying between 80 C and 90 C. My GPU is a water cooled gtx 1080 "Super-clocked" edition in a 24/7 refrigerated room. The model works, but this temperature worries me. I'd like to know if this is normal and safe.
I'm training the LSTM for a next-word-prediction problem with tokenized reddit comments. I got the idea from different tutorials in wildml.com. Here are some details about it:
Tensorflow 1.2.1, Cuda tk 8.0, Cudnn 6.0, Nvidia Driver 375.66
My training data consists of 200 K reddit comments.
My word dictionary consists of 8000 words, which means 8000 classes of classification for each prediction
I use GLOVE pre-trained 100 Dimensions embeddings of Wikipedia words
I'm not using placeholders to feed my input. It's all done with TFRecordfiles readers, which input the examples to a 100k capacity random shuffle queue
From the random shuffle queue, it goes to a padding FIFO queue, where I generated zero-paddaded mini-batches of 20
The 20 size mini batches go to a tf.dynamic_rnn() with LSTM cell with Hidden dimension of 150
I mask the losses using tf.sign() and minimize the result with Adam optimizer
I've noticed that the temperature rises a lot when I raise the mini-batch size. 1 size mini-batches (single examples), it reads between 72-75 C. With 10 size mini-batches, it immediately goes to 78 C and stays in the range of 78-84 C. With 20 size mini-batches, 84-88 C. With 30 size mini-batches, 87-92 C.
If I raise the hidden dimension to 200, 250, 300, etc, while maintaining the minibatch size fixed, I also get similar temperature raises.
I've also trained the same model, but feeding the data with placeholders only, i.e, not using TFRecord, Queues and mini-batches. It stays around 65 C, but it's obviously far from optimized and ideal to use placeholders for feeding the net.
I really appreciate your help, I'm kinda desperate, to be honest.
-----------------EDIT---------------------
It turns out the water cooler pump was configured on my bios to variate according to the CPU temp...Obviously the GPU temp wouldn't affect it and thats what happened. It was running on 50 % of its capacity. Well, I've ajusted it to stay 100% all the time and now the same model runs with max temp of approx. 83 C. Still not perfect, but a huge improvement. I guess that with the complexity of my model + the really high 1.8 GHz clock of my GPU there's not much I can do.
The maximum design temperature of the GTX 1080 according to nvidia is 94 C. Anything below that and you should be safe.
Maximum GPU Temperature (in C) 94
The fact that the GPU temperature rises when you raise the mini-batch sizes is a good sign, this means that your GPU is working as hard as it can. In fact, if your GPU is not at ~80-90 C, this means that it is not working at full power, and you are losing some performance.

Google TensorFlow based seq2seq model crashes while training

I have been trying to use Google's RNN based seq2seq model.
I have been training a model for text summarization and am feeding in a textual data approximately of size 1GB. The model quickly fills up my entire RAM(8GB), starts filling up even the swap memory(further 8GB) and crashes post which I have to do a hard shutdown.
The configuration of my LSTM network is as follows:
model: AttentionSeq2Seq
model_params:
attention.class: seq2seq.decoders.attention.AttentionLayerDot
attention.params:
num_units: 128
bridge.class: seq2seq.models.bridges.ZeroBridge
embedding.dim: 128
encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
encoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
decoder.class: seq2seq.decoders.AttentionDecoder
decoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
optimizer.name: Adam
optimizer.params:
epsilon: 0.0000008
optimizer.learning_rate: 0.0001
source.max_seq_len: 50
source.reverse: false
target.max_seq_len: 50
I tried decreasing the batch size from 32 to 16, but it still did not help. What specific changes should I make in order to prevent my model from taking up the entirety of RAM and crashing? (Like decreasing data size, decreasing number of stacked LSTM cells, further decreasing batch size etc)
My system runs Python 2.7x, TensorFlow version 1.1.0, and CUDA 8.0. The system has an Nvidia Geforce GTX-1050Ti(768 CUDA cores) with 4GB of memory, and the system has 8GB of RAM and a further 8GB of swap memory.
You model looks pretty small. The only thing kind of big is the train data. Please check to make sure your get_batch() function has no bugs. It is possible that each batch you are actually loading the whole data set for training, in case there is a bug there.
In order to quickly prove this, just cut down your training data size to something very small (such as 1/10 of current size) and see if that helps. Note that it should not help because you are using mini batch. But if that resolve the problem, fix your get_batch() function.