Log device info in DNNClassifier estimator in Tensorflow - tensorflow

I am using DNNClassifier Estimator to train a binary classifier. I want to log device info to verify whether my model is running on GPU or CPU.
Since, with using Estimator we don't deal with session, how can I log device info?
Major Problem: My 3 layered neural net with hidden units [100, 75, 50] is running faster on CPU than GPU. I tried to increase batch size till 256 but still the same. Hence, I want to confirm whether it actually is using GPU.

Use config argument of tf.estimator.Estimator.__init__:
classifier = \
DNNClassifier(feature_columns=feature_columns,
hidden_units=[100, 75, 50],
config=tf.estimator.RunConfig(session_config=tf.ConfigProto(log_device_placement=True)))

Related

Tensorflow not using multiple GPUs - getting OOM

I'm running into OOM on a multi-gpu machine, because TF 2.3 seems to be allocating a tensor using only one GPU.
tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:539 :
Resource exhausted: OOM when allocating tensor with shape[20532,64,48,32]
and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.
But tensorflow does recognize multiple GPUs when I run my code:
Adding visible gpu devices: 0, 1, 2
Is there anything else I need to do to have TF use all GPUs?
The direct answer is yes, you do need to do more to get TF to recognize multiple GPUs. You should refer to this guide but the tldr is
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
...
https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_tfkerasmodelfit
But in your case, something else is happening. While this one tensor may be triggering the OOM, it's likely because a few previous large tensors were allocated.
The first dimension, your batch size, is 20532, which is really big. Since the factorization of that is 2**2 × 3 × 29 × 59, I'm going to guess you are working with CHW format and your source image was 3x64x128 which got trimmed after a few convolutions. I'd suspect an inadvertent broadcast. Print a model.summary() and then review the sizes of the tensors coming out of each layer. You may also need to look at your batching.

Identify memory leak in tensorflow data pipeline and training?

Memory Leaks With TF.Data.Dataset pipeline. Is there a profiler to identify the memory leaks in the pipeline or tf.keras training?
Few questions if you have any thoughts –
1. Is there an obvious problem in the pseudo code that I am overlooking?
2. Any thoughts on where/what to look for?
3. Any pointers to how to memory profile RAM usage as the training goes on to pin point problem?
I just moved my codebase to eager mode under Tensorflow 1.15 and I am running into memory issues that I didn’t have before. Before moving to eager mode, I could training for 500+ epocs without any issues and now, training stops after 70 epocs. I am trying to figure out a way to identify where the leak is and I was hoping some of you have some ideas.
I am using tf.data.Dataset to build the data pipeline (see pseudo code below) and to speed up the data feeding, I am using datasets with interleave as shown below. I have preprocessed data that is stored in sharded TFRecords files and the dataset API loads up data and does minimal processing to supply the appropriate batch sized data. GPU memory seems fine and training goes on until the CPU RAM is completely depleted. As you see the table below, psutil memory log shows continuous increase of CPU RAM.
What I have tried:
Explicitly call gc.collect, tf.set_random_seed(1) as suggested by these but none seems to help.
https://github.com/tensorflow/tensorflow/issues/30324
Memory Continually Increasing When Iterating in Tensorflow Eager Execution
Ubuntu 18.04, tf-nightly-gpu 1.15.0.dev20190728
CPU - AMD Ryzen Threadripper 1920X 12-Core Processor
RAM – 128GB
GPU - RTX 2080 Ti 11GB
#Generator that is passed to the fit_generator
def get_simple_Dataset_generator(….):
dataset = load_dataset (…)
while True:
try:
for x, Y in dataset:
yield x, Y
finally:
dataset = load_dataset (“change data sources”)
#tried gc.collect(), tf.set_random_seed(1)
#sets up the dataset with interleave.
def load_dataset(…):
#setup etc
dataset = dataset.interleave(lambda x:
tf.data.Dataset.from_generator(self.simple_gen_step1,
output_types=(tf.string, tf.float32, tf.string), args=(x,batch_size, lstm_reshape,)),
cycle_length=2,
block_length=1)
dataset = dataset.interleave(lambda each_ticker, each_dataset, each_dates: tf.data.Dataset.from_generator(self.simple_gen_step2,
output_types=(tf.float32, tf.int16), args=(names, dataset, dates,batch_size,)),
cycle_length=2,
block_length=1)
return dataset
#Our Model uses CuDNNLSTM and Dense layers
build_model():
model = Sequential()
model.add(CuDNNLSTM(feature_count,
batch_input_shape=(batch_size,look_back, feature_count),
stateful=Settings.get_config(Settings.STATEFUL),
return_sequences=True))
model.add(CuDNNLSTM(feature_count, return_sequences = True))
model.add(CuDNNLSTM(feature_count, return_sequences = True))
model.add(CuDNNLSTM(feature_count, return_sequences = False))
model.add(Dropout(dropout))
model.add(Dense( max(feature_count//(2*1), target_classes), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(max(feature_count//(2*2), target_classes),use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(max(feature_count//(2*3), target_classes), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(target_classes, activation='softmax'))
return model
CPU RAM Shown in psutil log
For anyone running into similar issue, I think there is a memory leak if class_weights are used in fit_generator. I have posted another one with more details.
Using class_weights in fit_generator causes memory leak
I thought I'd share what I have found regarding memory leakage in TensorFlow 2.x.x. It might not be a 100 % specific answer to your concrete questions but it might help others to solve their memory leakage issues when using built-in functions like model.fit().
Here is a link to one of the related GitHub issues and here is my solution (please also cosider the comments to my solution).

OOM with tensorflow

I'm facing an OOM error whole training my tensorflow model, the structure is as follows:
tf.contrib.layers.embed_sequence initialized with GoogleNewsVector
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #forward
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #backward
tf.nn.bidirectional_dynamic_rnn wrapping the above layers
tf.layers.dense as an output layer
i tried to reduce the batch size down to as low as 64, my input data is padded to 1500, and my vocab size is 8938
The cluster i'm using is very powerful (https://wiki.calculquebec.ca/w/Helios/en) i'm using two nodes with 8 GPUs each and still getting this error:
2019-02-23 02:55:16.366766: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at reverse_op.cc:270 : Resource exhausted: OOM when
allocating tensor with shape[2000,800,300] and type float on
/job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
I'm using the estimator API with MirroredStrategy and still no use, is there a way maybe to ask tensorflow to just run the training using the GPUs and keep the tensors stores on the main machine memory? Any other suggestions are welcome.
Running a particular operation (e.g. some tensor multiplication during training) using GPU requires having those tensors stored on the GPU.
You might want to use Tensorboard or something like that to see which operations require the most memory for your calculation graph. In particular, it's possible that first link between the embeddings and LSTM is the culprit and you'd need to narrow that somehow.

OOM error when training TensorFlow object-detection API using inception-resnet & NASnet (especially) as backbone

Please help me finding the solution to my problems. It's important for me to state first that, I have successfully created my own custom dataset and I have successfully trained that dataset using resnet101 on my own computer (16GB RAM and 4GB NVIDIA 980).
The problem arise when I tried to switch the backbone using inception-resnet and nasnet. I got the following error
"ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape ..."
And I thought I didn't have enough resource on my computer, so I created instance on AWS EC2 with 60GB RAM and 12GB NVIDIA Tesla K80 (my work place only provide this service) and trained the network there.
The training for inception-resnet worked well, however that's not the case with nasnet. Even with 100GB memory I still get OOM error
I found one solution on github tensorflow models web page at issue #1817 and I followed the instruction by adding the following line of code into nasnet config file
train_config: {
batch_size: 1
batch_queue_capacity: 50
num_batch_queue_threads: 8
prefetch_queue_capacity: 10
...
and the code ran well for a while (the following is "top" screenshot). However, I still got the OOM error after running around 6000 steps
INFO:tensorflow:global step 6348: loss = 2.0393 (3.988 sec/step)
INFO:tensorflow:Saving checkpoint to path /home/ubuntu/crack-detection/structure-crack/models/faster_rcnn_nas_coco_2017_11_08/train/model.ckpt
INFO:tensorflow:global step 6349: loss = 0.9803 (3.980 sec/step)
2018-01-25 05:51:25.959402: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 79.73MiB. Current allocation summary follows.
...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[64,17,17,4032]
[[Node: MaxPool2D/MaxPool = MaxPool[T=DT_FLOAT, data_format="NHWC", ksize=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1],
...
Is there anything else I could do to run this smoothly without any OOM errors? Thanks for your help
EDIT #1: The errors come more frequently now, it'll show after 1000-1500 steps.
EDIT #2: Based on the issue #2668 and issue #3014, there's one more thing we can do to be able to run the code without OOM error by adding second_stage_batch_size: 25 (default is 50) in model section of the config file. So, the file should look like the following
model{
faster_rcnn {
...
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_batch_size: 25
}
}
Hope this can help.
I would like to point out that the memory that you run out of is the one of the GPU, so I'm afraid those 100GB are only useful for data wrangling outside a training purpose. Also, without code, it's really difficult to figure out where the error is coming from.
That being said, if you can initialize the neural net architecture with weights, train for 6000 iterations and suddenly run out of GPU memory then I guess that you are either somehow storing values in GPU memory or, if you have variable length inputs, you might be passing a sequence, in that iteration, which is too big memory wise.

Tensorflow batching is very slow

I tried to setup a very simple Mnist example with an Estimator.
First I used the estimator's deprecated fit() parameters x, y and batch_size. This executed very fast and utilized about 100% of my GPU while not effecting the CPU much (about 10% utilization). So it worked as expected.
Because the x, y and batch_size parameters are deprecated, I wanted to use the input_fn parameter for the fit() function. To build the input_fn, I used a tf.slice_input_producer and batched it with tf.train.batch. This is my code https://gist.github.com/andreas-eberle/11f650fca0dce4c9d3d6c0955145e80d. You should be able to just run it with tensorflow 1.0.
My problem is that the training now runs very slow and only utilizes about 30 % of my GPU (shown in nvidia-smi).
I also tried to increase the queue capacity of the slice_input_producer and to increase the number of threads used for batching. However, this only helped to get to about 45% of GPU utilization and resulted in a 100 % GPU utilization.
What am I doing wrong? Is there a better way for feeding the inputs and batching them? I do not want to create the batches manually (creating subarrays of the numpy input array) because I want to use this example for a more complex input queue where I'll be reading and preprocessing the images in the graph.
I don't think my hardware should be the problem:
List item
Windows 10
NVidia GTX 960M
i7-6700HQ
32 GB RAM