How does size of training image, training network and inference network affect the accuracy of YOLOv4 Darknet model? - object-detection

Our use case is to train a YOLOv4 network to detect an object as small as a wedding band on top of a table. So the object is about 250px by 250px in a 4096px x 2160px.
Training network size: According to the README.MD, should our training net to be as big as it can fit within the GPU memory? In our case, it's 1056 x 1056 for a Titan T4 with 16GB GPU RAM
Training images size: For our training images of the object, they are around 1000 px by 1000 px, from the Darknet documentation I see that it will resize when random=1, so I assume we are good with the relatively high resolution training images?
Detection network size: During the detection, in the same README.MD, the first point of item 2 suggested increasing the network-resolution for detection. So if we were to use 1056x1056 during training, should we use 1280 x 1280 (32 * 40) or larger net.width and net.height?
Thanks!

Related

How many training samples should I take for an object detection model with 62 classes?

I'm trying to train a YOLOv3 model for 62 classes using https://github.com/wizyoung/YOLOv3_TensorFlow.
How many samples should I take for each class.
I'm using a Nvidia GTX 1050Ti GPU so what should be my batch size with each image of 300*300 size?
Is 80-20 train/test split ideal?
The 80-20% train-test(val) split is dependent on the number of samples, not on the number of classes. The more data you have, the bigger the discrepancy percentage between train and test(val) it can be (for millions of samples data you can have 95%---5% split)
Normally, at least (minimum) number of 200 bounding_boxes_annotations per object should be present. That is, each of your classes should have at least 200 annotations.
1050Ti has only 4GB VRAM. Depending on your image_size, you can increase or decrease the batch_size. However, take into consideration that you do not have very much VRAM available, most likely(decrease it to 1 if you have OOM issues) a batch_size of 2 for images of 300x300 will be the maximum you can achieve.

GPU temperature reads 88 C when training a LSTM on tensorflow

I've got a 1 layer LSTM model in tensorflow and the temperature reading of my GPU gets rather high during the training phase. Always varying between 80 C and 90 C. My GPU is a water cooled gtx 1080 "Super-clocked" edition in a 24/7 refrigerated room. The model works, but this temperature worries me. I'd like to know if this is normal and safe.
I'm training the LSTM for a next-word-prediction problem with tokenized reddit comments. I got the idea from different tutorials in wildml.com. Here are some details about it:
Tensorflow 1.2.1, Cuda tk 8.0, Cudnn 6.0, Nvidia Driver 375.66
My training data consists of 200 K reddit comments.
My word dictionary consists of 8000 words, which means 8000 classes of classification for each prediction
I use GLOVE pre-trained 100 Dimensions embeddings of Wikipedia words
I'm not using placeholders to feed my input. It's all done with TFRecordfiles readers, which input the examples to a 100k capacity random shuffle queue
From the random shuffle queue, it goes to a padding FIFO queue, where I generated zero-paddaded mini-batches of 20
The 20 size mini batches go to a tf.dynamic_rnn() with LSTM cell with Hidden dimension of 150
I mask the losses using tf.sign() and minimize the result with Adam optimizer
I've noticed that the temperature rises a lot when I raise the mini-batch size. 1 size mini-batches (single examples), it reads between 72-75 C. With 10 size mini-batches, it immediately goes to 78 C and stays in the range of 78-84 C. With 20 size mini-batches, 84-88 C. With 30 size mini-batches, 87-92 C.
If I raise the hidden dimension to 200, 250, 300, etc, while maintaining the minibatch size fixed, I also get similar temperature raises.
I've also trained the same model, but feeding the data with placeholders only, i.e, not using TFRecord, Queues and mini-batches. It stays around 65 C, but it's obviously far from optimized and ideal to use placeholders for feeding the net.
I really appreciate your help, I'm kinda desperate, to be honest.
-----------------EDIT---------------------
It turns out the water cooler pump was configured on my bios to variate according to the CPU temp...Obviously the GPU temp wouldn't affect it and thats what happened. It was running on 50 % of its capacity. Well, I've ajusted it to stay 100% all the time and now the same model runs with max temp of approx. 83 C. Still not perfect, but a huge improvement. I guess that with the complexity of my model + the really high 1.8 GHz clock of my GPU there's not much I can do.
The maximum design temperature of the GTX 1080 according to nvidia is 94 C. Anything below that and you should be safe.
Maximum GPU Temperature (in C) 94
The fact that the GPU temperature rises when you raise the mini-batch sizes is a good sign, this means that your GPU is working as hard as it can. In fact, if your GPU is not at ~80-90 C, this means that it is not working at full power, and you are losing some performance.

Google TensorFlow based seq2seq model crashes while training

I have been trying to use Google's RNN based seq2seq model.
I have been training a model for text summarization and am feeding in a textual data approximately of size 1GB. The model quickly fills up my entire RAM(8GB), starts filling up even the swap memory(further 8GB) and crashes post which I have to do a hard shutdown.
The configuration of my LSTM network is as follows:
model: AttentionSeq2Seq
model_params:
attention.class: seq2seq.decoders.attention.AttentionLayerDot
attention.params:
num_units: 128
bridge.class: seq2seq.models.bridges.ZeroBridge
embedding.dim: 128
encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
encoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
decoder.class: seq2seq.decoders.AttentionDecoder
decoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
optimizer.name: Adam
optimizer.params:
epsilon: 0.0000008
optimizer.learning_rate: 0.0001
source.max_seq_len: 50
source.reverse: false
target.max_seq_len: 50
I tried decreasing the batch size from 32 to 16, but it still did not help. What specific changes should I make in order to prevent my model from taking up the entirety of RAM and crashing? (Like decreasing data size, decreasing number of stacked LSTM cells, further decreasing batch size etc)
My system runs Python 2.7x, TensorFlow version 1.1.0, and CUDA 8.0. The system has an Nvidia Geforce GTX-1050Ti(768 CUDA cores) with 4GB of memory, and the system has 8GB of RAM and a further 8GB of swap memory.
You model looks pretty small. The only thing kind of big is the train data. Please check to make sure your get_batch() function has no bugs. It is possible that each batch you are actually loading the whole data set for training, in case there is a bug there.
In order to quickly prove this, just cut down your training data size to something very small (such as 1/10 of current size) and see if that helps. Note that it should not help because you are using mini batch. But if that resolve the problem, fix your get_batch() function.

How to solve not enough GPU memory when traning big model in tensorflow?

I am running LSTM demo in tensorflow.
Cell output size 461*461*4*120=100MB (120 hidden nodes)
Softmax output size 461*461*4*256=200MB
But use Nvidia 960 (4G memory) to running this demo will exhaust all GPU-memory,Why?
If hidden nodes up to 1000, use doble GPU (Nvidia 1080) is hard to make this work ,How to figure it out?

GPU + CPU Tensorflow Training

Setup
I have a network, one whose parameter is a large-embedding matrix (3Million X 300 sized), say embed_mat.
During training, for each mini-batch, I only update a small subset of the vectors from embed_mat (max 15000 vectors) which are chosen using the embedding_lookup op. I am using the Adam optimizer to train my model.
As I cannot store this embed_mat in the GPU, due to its size, I define it under CPU (say /cpu:0) device, but the rest of the parameters of the model, the optimizer etc. are defined under a GPU (say, gpu:/0) device.
Questions
I see that my GPU usage is very minimal (200 MB), which suggests all my training is happening on the CPU. What I expected was that the result of the embedding_lookup is copied to the GPU and all my training happens there. Am I doing something wrong.
The training time is very largely affected by the size (num_vectors) of the embedding matrix which doesn't seem correct to me. In any mini-batch, I only update my network parameters and the vectors I looked up (~15000), so the training time should, if at all, grow sub-linearly with the size of the embedding matrix.
Is there a way to automatically and seamlessly split up my embed_mat to multiple GPUs for faster training?
I suspect the Adam Optimizer for this. Looks like because the embed_mat is on the CPU, all training is happening on the CPU. Is this correct?
Try visualizing on tensorboard where each of your ops is placed. In the "graph" tab you can color by "device". Ideally the embedding variable, the embedding lookup, and the embedding gradient update should be in the CPU, while most other things should be in the GPU.