Tensorflow model.fit crashed in while loop - tensorflow

I am trying to optimise the learning_rate parameter of my ML-model with a while-loop.
The first model completes all of its learning steps, however, the in the second iteration of the while-loop and thus the second call of model.fit() fails already in the first epoch. No output is generated.
Edit:
I have traced the problem to the Tensorboard callback. Without that call-back the loop successfully trains all 4 models, while with the callback the loop fails at the beginning of the second iteration/model fit. What am I doing wrong here?
for lr in [0.005, 0.001, 0.0005, ...]:
tf.keras.backend.clear_session()
...
## create model
model = createModel(...)
model.compile(learning_rate = lr)
tensorboard_callback = tf.keras.callbacks.TensorBoard(...)
model.fit(..., callbacks = [tensorboard_callback])
Does anybody know what I am doing wrong or why this does not work?
Thank you very much!
Environment:
I am working on a Debian server with two Nvidia Tesla V100S (32GB) cards (only one is used for training the model), 128 CPU cores and 2TB main memory
Python: 3.7.9
Tensorflow: 2.4.1
The implementation is inside a Jupyter notebook

The problem has been solved. There was an incorrect version of cuDNN installed. Thanks

Related

Program crashed in the last step in test Tensorflow-gpu 2.0.0

When using Tensorflow 2.0.0 and split dataset into train-set and test-set. The training and testing code is as following:
for epoch in range(params.num_epochs):
for step, (x_batch_train, y_batch_train) in enumerate(train_dist_dataset):
DO TRAINING HERE....
if epoch % params.num_epoch_record == 0:
for step, (x_test, y_test) in enumerate(test_dist_dataset):
DO TESTing HERE....
checkpoint.step.assign_add(1)
save_path = manager.save()
logger.info("Saved checkpoint {}".format(save_path))
However, when after the last test data in enumerate(test_dist_dataset) the program will crash and shows up:
F .\tensorflow/core/kernels/conv_2d_gpu.h:964] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, kNumThreads, kTileSize, kTileSize, conjugate>, total_tiles_count, kNumThreads, 0, d.stream(), input, input_dims, output) status: Internal: invalid configuration argument
So, how it occurs and how to solve it?
In my case, the problem is related to the batch size. I am using nvidia docker 19.12 and data generator. The code works well with one gpu and the problem happened only with mirroredstrategy in the model.predict.
When the total number of data can not be divided by the batch_size perfectly, the error happens. For example, you have 5 data and batch_size is 2. The 3rd batch will have only one data and brings a problem.
The solution is either throw away the last data, or in my case, add some dummy data to fill up the last batch.

Running Tensorflow model inference script on multiple GPU

I'm trying to run the model scoring (inference graph) from tensorflow objec detection API to run it on multiple GPU's, tried specifying the GPU number in the main, but it runs only on single GPU.placed GPU utilization snapshot here
Using tensorflow-gpu==1.13.1, can you kindly point me what I'm missing here.
for i in range(2):
with tf.device('/gpu:{}' . format(i)):
tf_init()
init = tf.global_variables_initializer
with detection_graph.as_default():
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
call to #run_inference_multiple_images function
The responses to this question should give you a few options for fixing this.
Usually TensorFlow will occupy all visible GPUs unless told otherwise. So if you haven't already tried, you could just remove the with tf.device line (assuming you only have the two GPUs) and TensorFlow should use them both.
Otherwise, I think the easiest is setting the environment variables with os.environ["CUDA_VISIBLE_DEVICES"] = "0,1".

Tensorflow-loss not decreasing when training

I am using tensorflow object detection api for my own dataset I am facing some problem. I am using centos , with GPU Geforce 1080, 8 GB GPU memory, tensorflow 1.2.1 . I have 500 images in training set and 40 in test. I did the following steps and I have two problems.
1.I annotated my images using LabelImg tool
2.Created tfrecord successfully
3.I used ssd_inception_v2_coco.config. I modified the only path, no of class and I did not train from scratch, I used ssd_inception_v2_coco model checkpoints.
Problem 1: from step 0 until 3000, my loss has dramatically decreased but after that, it stays constant between 5 to 6 . Not getting how I reduce it but still my model able to detect required object. Here is my Tensorborad samples
Even i tried for diffent model eg. faster_rcnn_inception_resnet_v2_atrous_coco after some steps loss stay constant between 1 and 2
Problem 2: according to a document I able to run eval.py but getting the following error:
WARNING:root:The following classes have no ground truth examples: 0 after that program terminate.
I try to run train.py and eval.py at the same time still same error.
Please give me a suggestion. I am tensorflow beginner required suggestion.
The loss curve you're seeing on Tensorboard is quite normal. Initially, the loss will drop very quickly, but will seemingly "bottom out" over time. Training is a slow process, you should see a steady drop over time after more iterations.

Tensorboard doesn't show runtime/memory for all operations

I have a network implemented in TensorFlow that takes very long to train and therefore want to profile it to see which parts cause the long runtime.
To do that, I follow the instructions here to capture runtime and memory information. My code looks like this:
// define network
loss = ...
train_op = tf.train.AdamOptimizer().minimize(loss, global_step=global_step)
// run forward and backward prop for one batch
run_metadata = tf.RunMetadata()
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
_,loss,sum = sess.run([train_op,loss,sum], feed_dict=fd, options=options, run_metadata=run_metadata)
writer.add_run_metadata(run_metadata, 'step_%d' % step)
I can then see "session runs" in TensorBoard. However, as soon as I load a session run, most operations in my graph turn orange as shown below and no runtime or memory information is available for them:
According to the legend, these operations are "unsused". But that cannot be the case, as almost everything except "loss" and "opt" are shown like that. Clearly, the whole network has to be used to compute the loss. So I don't really see why the graph is shown like this.
I use TF 1.3 on a Tesla K40c.
I used to have the same problem as you with Tensorboard not registering anything in my session run except the gradient and optimizer ops.
I fixed it by upgrading my version of Tensorflow to the 1.4 release.
Not sure.
Try adding this line
writer.add_summary(_, step)
after
writer.add_run_metadata...

Cannot run Tensorflow code multiple times in Jupyter Notebook

I'm struggling running Tensorflow (v1.1) code multiple times in Jupyter Notebook.
For example, I execute this simple code snippet that creates an encoding layer for a seq2seq model:
# Construct encoder layer (LSTM)
encoder_cell = tf.contrib.rnn.LSTMCell(encoder_hidden_units)
encoder_outputs, encoder_final_state = tf.nn.dynamic_rnn(
encoder_cell, encoder_inputs_embedded,
dtype=tf.float32, time_major=False
)
First time is totally fine, my encoder is created.
However, if I rerun it (no matter the changes I've applied), I get this error:
Attempt to have a second RNNCell use the weights of a variable scope that already has weights
It's very annoying as it forces me to restart the kernel every time I want to change a layer.
Can someone explain me why this happens and how I can fix this ?
Thanks!
You are trying to build the exact same graph twice and therefore TensorFlow complains because the variables already exist in the default graph.
What you could do is to call tf.reset_default_graph() before trying to call the method a second time to ensure you create a new graph when required.
Just in case, I would also suggest using an interactive session as described here in the Start TensorFlow InteractiveSession section:
import tensorflow as tf
sess = tf.InteractiveSession()