Google TensorFlow based seq2seq model crashes while training - tensorflow

I have been trying to use Google's RNN based seq2seq model.
I have been training a model for text summarization and am feeding in a textual data approximately of size 1GB. The model quickly fills up my entire RAM(8GB), starts filling up even the swap memory(further 8GB) and crashes post which I have to do a hard shutdown.
The configuration of my LSTM network is as follows:
model: AttentionSeq2Seq
model_params:
attention.class: seq2seq.decoders.attention.AttentionLayerDot
attention.params:
num_units: 128
bridge.class: seq2seq.models.bridges.ZeroBridge
embedding.dim: 128
encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
encoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
decoder.class: seq2seq.decoders.AttentionDecoder
decoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
optimizer.name: Adam
optimizer.params:
epsilon: 0.0000008
optimizer.learning_rate: 0.0001
source.max_seq_len: 50
source.reverse: false
target.max_seq_len: 50
I tried decreasing the batch size from 32 to 16, but it still did not help. What specific changes should I make in order to prevent my model from taking up the entirety of RAM and crashing? (Like decreasing data size, decreasing number of stacked LSTM cells, further decreasing batch size etc)
My system runs Python 2.7x, TensorFlow version 1.1.0, and CUDA 8.0. The system has an Nvidia Geforce GTX-1050Ti(768 CUDA cores) with 4GB of memory, and the system has 8GB of RAM and a further 8GB of swap memory.

You model looks pretty small. The only thing kind of big is the train data. Please check to make sure your get_batch() function has no bugs. It is possible that each batch you are actually loading the whole data set for training, in case there is a bug there.
In order to quickly prove this, just cut down your training data size to something very small (such as 1/10 of current size) and see if that helps. Note that it should not help because you are using mini batch. But if that resolve the problem, fix your get_batch() function.

Related

How many training samples should I take for an object detection model with 62 classes?

I'm trying to train a YOLOv3 model for 62 classes using https://github.com/wizyoung/YOLOv3_TensorFlow.
How many samples should I take for each class.
I'm using a Nvidia GTX 1050Ti GPU so what should be my batch size with each image of 300*300 size?
Is 80-20 train/test split ideal?
The 80-20% train-test(val) split is dependent on the number of samples, not on the number of classes. The more data you have, the bigger the discrepancy percentage between train and test(val) it can be (for millions of samples data you can have 95%---5% split)
Normally, at least (minimum) number of 200 bounding_boxes_annotations per object should be present. That is, each of your classes should have at least 200 annotations.
1050Ti has only 4GB VRAM. Depending on your image_size, you can increase or decrease the batch_size. However, take into consideration that you do not have very much VRAM available, most likely(decrease it to 1 if you have OOM issues) a batch_size of 2 for images of 300x300 will be the maximum you can achieve.

tensorflow wide linear model inference on gpu slow

I am training a sparse logistic regression model on tensorflow. This problem is specifically about the inference part. I am trying to benchmark inference on on cpu and gpu. I am using the Nvidia P100 gpu (4 dies) on my current GCE box. I am new to gpu so sorry for naive questions.
The model is pretty big ~54k operation (is it considered big compared to dnn or imagenet models ? ) . When i log device placement , i only see gpu:0 being used , and rest of them unused ? I don't do any device placement during training , but during inference i want it to optimally place and use gpu.
Few things i observed : my input node placehoolder (feed_dict) is placed on cpu, so i assume my data is being copied from cpu to gpu ? how does feed_dict exactly work behind the scene ?
1) How can i place my data on which i want to run prediction directly on gpu ? Note : my training runs on distributed cpu with multiple terabytes so i cannot have constant or variable directly in my graph during training , but my inference i can definitely have small batches of data that i would directly like to place on gpu. Are there ways i can achieve this ?
2) Since i am using P100 gpu , i think it has unified memory with host , is it possible to have zerocopy and directly have my data loaded into gpu ? How can i do this from python , java and c++ code. Currently i use feed_dict which from various google sources i think is not at all optimal .
3) Is there some tool or profiler i can use to see when i profile code like :
for epoch_step in epochs:
start_time = time.time()
for i in range(epoch_step):
result = session.run(output, feed_dict={input_example: records_batch})
end_time = time.time()
print("Batch {} epochs {} :time {}".format(batch_size, epoch_step, str(end_time - start_time)))
how much time is being spent on 1) cpu to gpu data transfer 2) session run overhead 3) gpu utilization (currently i use nvidia-smi periodically to monitor
4) kernel call overhead on cpu vs gpu (I assume each invokation of sess.run invokes 1 kernel call right ?
my current bench marking results :
CPU :
Batch size : 10
NumberEpochs TimeGPU TimeCPU
10 5.473 0.484
20 11.673 0.963
40 22.716 1.922
100 56.998 4.822
200 113.483 9.773
Batch size : 100
NumberEpochs TimeGPU TimeCPU
10 5.904 0.507
20 11.708 1.004
40 23.046 1.952
100 58.493 4.989
200 118.272 9.912
Batch size : 1000
NumberEpochs TimeGPU TimeCPU
10 5.986 0.653
20 12.020 1.261
40 23.887 2.530
100 59.598 6.312
200 118.561 12.518
Batch size : 10k
NumberEpochs TimeGPU TimeCPU
10 7.542 0.969
20 14.764 1.923
40 29.308 3.838
100 72.588 9.822
200 146.156 19.542
Batch size : 100k
NumberEpochs TimeGPU TimeCPU
10 11.285 9.613
20 22.680 18.652
40 44.065 35.727
100 112.604 86.960
200 225.377 174.652
Batch size : 200k
NumberEpochs TimeGPU TimeCPU
10 19.306 21.587
20 38.918 41.346
40 78.730 81.456
100 191.367 202.523
200 387.704 419.223
Some notable observations:
As batch size increase i see my gpu utilization increase (reaches to 100% for the only gpu it uses , is there a way i can tell tf to use other gpu too)
at batch size 200k is the only time i see my naive benchmarking shows gpu has minor gain as compared to cpu.
Increasing batch size for a given epoch has minimal effect on time both cpu and gpu until batch size <= 10k. But after that increasing batch size from 10k -> 100k -> 200k the time also increase quite fast i.e for a given epoch let us say 10 batch size 10, 100, 1k, 10k the cpu time and gpu time remain pretty stable ~5-7 sec for gpu and 0.48-0.96 sec for cpu (meaning that sess.run has much higher overhead than computing of graph themselves ?), but increasing batch size further the compute time increase at much faster rate i.e for epoch 10 100k->200k gputime increase from 11 -> 19 sec and cpu time also doubles , why so ? It seems for larger batch size even though i have just one sess.run , but internally it splits that into smaller batch and calls sess.run twice because epoch 20 batch size 100k matches more closely with epoch 10 batch 200k ..
How can i improve my inference further , i believe i am not usding all gpus optimally.
Are there any ideas around how can i benchmark better to get better breakdowns of time for cpu-> gpu transfer and actual speedup for graph computation from moving from cpu to gpu ?
Loading data better directly if possible zero copy into gpu ?
Can i place some nodes to gpu only during inference to get better performance ?
Ideas around quantization or optimizing inference graph ?
Any more ideas to improve gpu based inference . May be xla based optimization or tensrort ? i want to have high performance inference code to run these computations on gpu while the application server crunches on cpu.
One source of information are the TensorFlow docs on performance, including Optimizing for GPU and High Performance Models.
That said, those guides tend to target training more than batch inference, though certainly some of the principles still apply.
I will note that, unless you are using DistributionStrategy, TensorFlow will not automatically put ops on more than a single GPU (source).
In your particularly case, I don't believe GPUs are yet well-tuned to do the type of sparse operation required for your model, so I don't actually expect it to do that well on a GPU (if you log the device placement there's a chance the lookup is done on the CPU). A logistic regression model has only an (sparse) input layer and an output layer, so generally there are very few math ops. GPUs excel the most when they are doing lots of matrix multiplies, convolutions, etc.
Finally, I would encourage you to use TensorRT to optimize your graph, though for your particular model there's no guarantee it does much better.

GPU temperature reads 88 C when training a LSTM on tensorflow

I've got a 1 layer LSTM model in tensorflow and the temperature reading of my GPU gets rather high during the training phase. Always varying between 80 C and 90 C. My GPU is a water cooled gtx 1080 "Super-clocked" edition in a 24/7 refrigerated room. The model works, but this temperature worries me. I'd like to know if this is normal and safe.
I'm training the LSTM for a next-word-prediction problem with tokenized reddit comments. I got the idea from different tutorials in wildml.com. Here are some details about it:
Tensorflow 1.2.1, Cuda tk 8.0, Cudnn 6.0, Nvidia Driver 375.66
My training data consists of 200 K reddit comments.
My word dictionary consists of 8000 words, which means 8000 classes of classification for each prediction
I use GLOVE pre-trained 100 Dimensions embeddings of Wikipedia words
I'm not using placeholders to feed my input. It's all done with TFRecordfiles readers, which input the examples to a 100k capacity random shuffle queue
From the random shuffle queue, it goes to a padding FIFO queue, where I generated zero-paddaded mini-batches of 20
The 20 size mini batches go to a tf.dynamic_rnn() with LSTM cell with Hidden dimension of 150
I mask the losses using tf.sign() and minimize the result with Adam optimizer
I've noticed that the temperature rises a lot when I raise the mini-batch size. 1 size mini-batches (single examples), it reads between 72-75 C. With 10 size mini-batches, it immediately goes to 78 C and stays in the range of 78-84 C. With 20 size mini-batches, 84-88 C. With 30 size mini-batches, 87-92 C.
If I raise the hidden dimension to 200, 250, 300, etc, while maintaining the minibatch size fixed, I also get similar temperature raises.
I've also trained the same model, but feeding the data with placeholders only, i.e, not using TFRecord, Queues and mini-batches. It stays around 65 C, but it's obviously far from optimized and ideal to use placeholders for feeding the net.
I really appreciate your help, I'm kinda desperate, to be honest.
-----------------EDIT---------------------
It turns out the water cooler pump was configured on my bios to variate according to the CPU temp...Obviously the GPU temp wouldn't affect it and thats what happened. It was running on 50 % of its capacity. Well, I've ajusted it to stay 100% all the time and now the same model runs with max temp of approx. 83 C. Still not perfect, but a huge improvement. I guess that with the complexity of my model + the really high 1.8 GHz clock of my GPU there's not much I can do.
The maximum design temperature of the GTX 1080 according to nvidia is 94 C. Anything below that and you should be safe.
Maximum GPU Temperature (in C) 94
The fact that the GPU temperature rises when you raise the mini-batch sizes is a good sign, this means that your GPU is working as hard as it can. In fact, if your GPU is not at ~80-90 C, this means that it is not working at full power, and you are losing some performance.

Tensorflow batching is very slow

I tried to setup a very simple Mnist example with an Estimator.
First I used the estimator's deprecated fit() parameters x, y and batch_size. This executed very fast and utilized about 100% of my GPU while not effecting the CPU much (about 10% utilization). So it worked as expected.
Because the x, y and batch_size parameters are deprecated, I wanted to use the input_fn parameter for the fit() function. To build the input_fn, I used a tf.slice_input_producer and batched it with tf.train.batch. This is my code https://gist.github.com/andreas-eberle/11f650fca0dce4c9d3d6c0955145e80d. You should be able to just run it with tensorflow 1.0.
My problem is that the training now runs very slow and only utilizes about 30 % of my GPU (shown in nvidia-smi).
I also tried to increase the queue capacity of the slice_input_producer and to increase the number of threads used for batching. However, this only helped to get to about 45% of GPU utilization and resulted in a 100 % GPU utilization.
What am I doing wrong? Is there a better way for feeding the inputs and batching them? I do not want to create the batches manually (creating subarrays of the numpy input array) because I want to use this example for a more complex input queue where I'll be reading and preprocessing the images in the graph.
I don't think my hardware should be the problem:
List item
Windows 10
NVidia GTX 960M
i7-6700HQ
32 GB RAM

GPU + CPU Tensorflow Training

Setup
I have a network, one whose parameter is a large-embedding matrix (3Million X 300 sized), say embed_mat.
During training, for each mini-batch, I only update a small subset of the vectors from embed_mat (max 15000 vectors) which are chosen using the embedding_lookup op. I am using the Adam optimizer to train my model.
As I cannot store this embed_mat in the GPU, due to its size, I define it under CPU (say /cpu:0) device, but the rest of the parameters of the model, the optimizer etc. are defined under a GPU (say, gpu:/0) device.
Questions
I see that my GPU usage is very minimal (200 MB), which suggests all my training is happening on the CPU. What I expected was that the result of the embedding_lookup is copied to the GPU and all my training happens there. Am I doing something wrong.
The training time is very largely affected by the size (num_vectors) of the embedding matrix which doesn't seem correct to me. In any mini-batch, I only update my network parameters and the vectors I looked up (~15000), so the training time should, if at all, grow sub-linearly with the size of the embedding matrix.
Is there a way to automatically and seamlessly split up my embed_mat to multiple GPUs for faster training?
I suspect the Adam Optimizer for this. Looks like because the embed_mat is on the CPU, all training is happening on the CPU. Is this correct?
Try visualizing on tensorboard where each of your ops is placed. In the "graph" tab you can color by "device". Ideally the embedding variable, the embedding lookup, and the embedding gradient update should be in the CPU, while most other things should be in the GPU.