tensorflow with multi-gpu and tf.RandomShuffleQueue - tensorflow

I am trying to modify the code of mask rcnn to run it on multi-gpu, based on the sample of cifar10, the most part of code is below
One image and ground truth infomation is read from TFRecords file as below
image, ih, iw, gt_boxes, gt_masks, num_instances, img_id = \
Here the size of image and num_instance is different among images, then these inputs are stored in an RandomShuffleQueue as below
data_queue = tf.RandomShuffleQueue(capacity=32, min_after_dequeue=16,
image.dtype, ih.dtype, iw.dtype,
gt_boxes.dtype, gt_masks.dtype,
num_instances.dtype, img_id.dtype))
enqueue_op = data_queue.enqueue((image, ih, iw, gt_boxes, gt_masks, num_instances, img_id))
data_queue_runner = tf.train.QueueRunner(data_queue, [enqueue_op] * 4)
tf.add_to_collection(tf.GraphKeys.QUEUE_RUNNERS, data_queue_runner)
the I use tower_grads to gather the gradients in each GPU, then average them, below is the code for multi-gpu
tower_grads = []
num_gpus = 2
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i) as scope:
(image, ih, iw, gt_boxes, gt_masks, num_instances, img_id) = data_queue.dequeue()
im_shape = tf.shape(image)
image = tf.reshape(image, (im_shape[0], im_shape[1], im_shape[2], 3))
total_loss = compute_loss() # use tensor from dequeue operation to compute loss
grads = compute_grads(total_loss)
grads = average_grads(tower_grads)
when num_gpus=1, the code works well(I mean there is no error), but when I use two TITAN X GPUs, there are some strange errors below
failed to enqueue async me mset operation: CUDA_ERROR_INVALID_HANDLE
Internal: Blas GEMM launch failed
and the error is not the same when you run the code several times. I can't figure out why these errors occur for multi-gpu, some conflicts on data queue or GPUs?


How do I use all cores of my CPU in reinforcement learning with TF Agents?

I work with an RL algorithm. I'm using tensorflow and tf-agents and training a DQN. My problem is that only one core of the CPU is used when calculating the 10 episodes in the environment for data collection.
My training function looks like this:
def train_step(self, n_steps):
env_steps = tf_metrics.EnvironmentSteps()
#num_episodes = tf_metrics.NumberOfEpisodes()
rew = TFSumOfRewards()
action_hist = tf_metrics.ChosenActionHistogram(
name='ChosenActionHistogram', dtype=tf.int32, buffer_size=1000
#add reply buffer and metrict to the observer
replay_observer = [self.replay_buffer.add_batch]
train_metrics = [env_steps, rew]
driver = dynamic_episode_driver.DynamicEpisodeDriver(
self.train_env, self.collect_policy, observers=replay_observer + train_metrics, num_episodes=self.collect_episodes)
final_time_step, policy_state = driver.run()
print('Number of Steps: ', env_steps.result().numpy())
for train_metric in train_metrics:
train_metric.tf_summaries(train_step=self.global_step, step_metrics=train_metrics)
# Convert the replay buffer to a tf.data.Dataset
# Dataset generates trajectories with shape [Bx2x...]
AUTOTUNE = tf.data.experimental.AUTOTUNE
dataset = self.replay_buffer.as_dataset(
num_steps=(self.train_sequence_length + 1)).prefetch(AUTOTUNE)
iterator = iter(dataset)
train_loss = None
for _ in range(n_steps):
# Sample a batch of data from the buffer and update the agent's network.
experience, unused_info = next(iterator)
train_loss = self.agent.train(experience)
def train_agent(self, n_epoch):
for i in range(n_epoch):
if(self.IsAutoStoreCheckpoint == True):
As already written above, num_episodes = 10. So it would make sense to calculate 10 episodes in parallel before the network is trained.
If I set the value num_parallel_calls to e.g. 10 nothing changes. What do I have to do to use all cores of my CPU (Ryzen 9 5950x with 16 cores)?

Tensorflow, Keras: In a multi-class classification, accuracy is high, but precision, recall, and f1-score is zero for most classes

General Explanation:
My codes work fine, but the results are wired. I don't know the problem is with
the network structure,
or the way I feed the data to the network,
or anything else.
I am struggling with this error several weeks and so far I have changed the loss function, optimizer, data generator, etc., but I could not solve it. I appreciate any help.
If the following information is not enough, let me know, please.
Field of study:
I am using tensorflow, keras for multiclass classification. The dataset has 36 binary human attributes. I have used resnet50, then for each part of the body (head, upper body, lower body, shoes, accessories), I have added a separated branch to the network. The network has 1 input image with 36 labels and 36 output nodes (36 denes layers with sigmoid activation).
The problem is that the accuracy that keras is reporting is high, but f1-score is very low or zero for most of the outputs (even when I use f1-score as a metric when compiling the network, the f1-socre for validation is very bad).
aAfter train, when I use the network in prediction mode, it returns always one/zero for some classes. It means that the network is not able to learn (even when I use weighted loss function or focal loss function.)
Why it is weird? Because, state-of-the-art methods report heigh f1 score even after the first epoch (e.g. https://github.com/chufengt/iccv19_attribute, that I have run it in my PC and got good results after one epoch).
Parts of the Codes:
print("setup model ...")
input_image = KL.Input(args.img_input_shape, name= "input_1")
C1, C2, C3, C4, C5 = resnet_graph(input_image, architecture="resnet50", stage5=False, train_bn=True)
output_layers = merged_model (input_features=C4)
model = Model(inputs=input_image, outputs=output_layers, name='SoftBiometrics_Model')
print("model compiling ...")
OPTIM = optimizers.Adadelta(lr=args.learning_rate, rho=0.95)
model.compile(optimizer=OPTIM, loss=binary_focal_loss(alpha=.25, gamma=2), metrics=['acc',get_f1])
plot_model(model, to_file='model.png')
img_datagen = ImageDataGenerator(rotation_range=6, width_shift_range=0.03, height_shift_range=0.03, brightness_range=[0.85,1.15], shear_range=0.06, zoom_range=0.09, horizontal_flip=True, preprocessing_function=preprocess_input_resnet, rescale=1/255.)
img_datagen_test = ImageDataGenerator(preprocessing_function=preprocess_input_resnet, rescale=1/255.)
def multiple_outputs(generator, dataframe, batch_size, x_col):
Gen = generator.flow_from_dataframe(dataframe=dataframe,
x_col = x_col,
y_col = args.Categories,
target_size = (args.img_input_shape[0],args.img_input_shape[1]),
class_mode = "multi_output",
batch_size = batch_size,
shuffle = True)
while True:
gnext = Gen.next()
# return image batch and 36 sets of lables
labels = gnext[1]
output_dict = {"{}_output".format(Category): np.array(labels[index]) for index, Category in enumerate(args.Categories)}
yield {'input_1':gnext[0]}, output_dict
trainGen = multiple_outputs (generator = img_datagen, dataframe=Train_df_img, batch_size=args.BATCH_SIZE, x_col="Train_Filenames")
testGen = multiple_outputs (generator = img_datagen_test, dataframe=Test_df_img, batch_size=args.BATCH_SIZE, x_col="Test_Filenames")
STEP_SIZE_TRAIN = len(Train_df_img["Train_Filenames"]) // args.BATCH_SIZE
STEP_SIZE_VALID = len(Test_df_img["Test_Filenames"]) // args.BATCH_SIZE
print("Fitting the model to the data ...")
history = model.fit_generator(generator=trainGen,
callbacks= [chekpont],
There is a possibility that you are passing binary f1-score to compile function. This should fix the problem -
pip install tensorflow-addons
import tensorflow_addons as tfa
f1 = tfa.metrics.F1Score(36,'micro' or 'macro')
You can read more about how f1-micro and f1-macro is calculated and which can be useful here.
Somehow, the predict_generator() of Keras' model does not work as expected. I would rather loop through all test images one-by-one and get the prediction for each image in each iteration. I am using Plaid-ML Keras as my backend and to get prediction I am using the following code.
import os
from PIL import Image
import keras
import numpy
print("Prediction result:")
dir = "/path/to/test/images"
files = os.listdir(dir)
correct = 0
total = 0
#dictionary to label all traffic signs class.
classes = {
0:'This is Cat',
1:'This is Dog',
for file_name in files:
total += 1
image = Image.open(dir + "/" + file_name).convert('RGB')
image = image.resize((100,100))
image = numpy.expand_dims(image, axis=0)
image = numpy.array(image)
image = image/255
pred = model.predict_classes([image])[0]
sign = classes[pred]
if ("cat" in file_name) and ("cat" in sign):
print(correct,". ", file_name, sign)
elif ("dog" in file_name) and ("dog" in sign):
print(correct,". ", file_name, sign)
print("accuracy: ", (correct/total))

How to fix "Retval[0] has already been set" when serving saved model

I have a working SavedModel (ie. a saved model that works when restored in python) that fails when run on tensorflow serving.
The error message on the server is:
OP_REQUIRES failed at function_ops.cc:68 : Internal: Retval[0] has already been set.
The REST API returns 500 and specifies the node on the graph:
[[{{node _retval_loop/concat_0_0}}]
Exact Steps to Reproduce
(https://drive.google.com/file/d/1at1CQ9iHgcPHCn-MkvSGcgtbVM2lrKJn/view) link to saved model. it can be restored and run in python successfully but will throw an error if run on a model server. (Takes an image as input:
sess.run(fetches=["loop/Exit_1:0"],feed_dict={"image_bytes:0": image})
Source code / logs
Relevant source code(I hope):
(contains a while loop with a concat in the body)
val, idx =tf.nn.top_k(softmax ,name="topk")
sentence = tf.Variable([vocab.start_id],False,name="sentence",)
sentence = tf.concat([sentence, idx[0]], 0)#
def cond(sentence,state):
return tf.math.not_equal(
def body(sentence,state):
input_seqs = tf.expand_dims([sentence[-1]], 1)
seq_embeddings = tf.nn.embedding_lookup(self.embedding_map,
embed = seq_embeddings
# In inference mode, use concatenated states for convenient feeding and
# fetching.
state_feed = tf.concat(axis=1, values=state, name="state")
# Placeholder for feeding a batch of concatenated states.
# state_feed = tf.placeholder(dtype=tf.float32,
# shape=[None,
# name="state_feed")
state_tuple = tf.split(value=state_feed, num_or_size_splits=2, axis=1)
# Run a single LSTM step.
lstm_outputs, new_state_tuple = lstm_cell(
inputs=tf.squeeze(embed, axis=[1]),
# Concatentate the resulting state.
state = tf.concat(axis=1, values=new_state_tuple, name="state")
# Stack batches vertically.
lstm_outputs = tf.reshape(lstm_outputs, [-1, lstm_cell.output_size])
with tf.variable_scope("logits") as logits_scope:
logits = tf.contrib.layers.fully_connected(
scope=logits_scope, reuse = True
softmax = tf.nn.softmax(logits, name="softmax")
self.softmax = softmax
val, idx = tf.nn.top_k(softmax, name="topk")
sentence = tf.concat([sentence,idx[0]],0)
self.output = sentence
return [sentence, state]
out = tf.while_loop(cond, body, [sentence, state],parallel_iterations=1,maximum_iterations=20,name="loop",shape_invariants=[tf.TensorShape([None]),tf.TensorShape([None,None])])
return out
fails with error:
W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at function_ops.cc:68 : Internal: Retval[0] has already been set.
It could be the output nodes in sess.run contains node types that contain Enter, Merge, LoopCond, Switch, Exit, Less, etc.

tensorflow error - you must feed a value for placeholder tensor 'in'

I'm trying to implement queues for my tensorflow prediction but get the following error -
you must feed a value for placeholder tensor 'in' with dtype float and shape [1024,1024,3]
The program works fine if I use the feed_dict, Trying to replace feed_dict with queues.
The program basically takes a list of positions and passes the image np array to the input tensor.
for each in positions:
y,x = each
images = img[y:y+1024,x:x+1024,:]
a = images.astype('float32')
q = tf.FIFOQueue(capacity=200,dtypes=dtypes)
enqueue_op = q.enqueue(a)
qr = tf.train.QueueRunner(q, [enqueue_op] * 1)
data = q.dequeue()
with tf.Session(graph=graph,config=tf.ConfigProto(log_device_placement=True)) as sess:
p_boxes = graph.get_tensor_by_name("cat:0")
p_confs = graph.get_tensor_by_name("sha:0")
y = [p_confs, p_boxes]
x = graph.get_tensor_by_name("in:0")
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord,sess=sess)
confs, boxes = sess.run(y)
How can I make sure the input data that I populated to the queue is recognized while running the graph in the session.
In my original run I call the
confs, boxes = sess.run([p_confs, p_boxes], feed_dict=feed_dict_testing)
I'd suggest not using queues for this problem, and switching to the new tf.data API. In particular tf.data.Dataset.from_generator() makes it easier to feed in data from a Python function. You can rewrite your code to be much simpler, as follows:
def generator():
for y, x in positions:
images = img[y:y+1024,x:x+1024,:]
yield images.astype('float32')
dataset = tf.data.Dataset.from_generator(
generator, tf.float32, [1024, 1024, img.shape[3]])
# Add any extra transformations in here, like `dataset.batch()` or
# `dataset.repeat()`.
# ...
iterator = dataset.make_one_shot_iterator()
data = iterator.get_next()
Note that in your program, there's no connection between the data tensor and the graph you loaded in load_graph() (at least, assuming that load_graph() doesn't grab data from the global state!). You will probably need to use tf.import_graph_def() and the input_map argument to associate data with one of the tensors in your frozen graph (possibly "in:0"?) to complete the task.

LookUpError in TensorFlow with tf.cond()

Work environment
TensorFlow release version : 1.3.0-rc2
TensorFlow git version : v1.3.0-rc1-994-gb93fd37
Operating System : CentOS Linux release 7.2.1511 (Core)
Problem Description
I use tf.cond() to move between training and validation datasets at the time of processing. The following snippet shows how I have done :
with tf.variable_scope(tf.get_variable_scope()) as vscope:
for i in range(4):
with tf.device('/gpu:%d'%i):
with tf.name_scope('GPU-Tower-%d'%i) as scope:
worktype = tf.get_variable("wt",[], initializer=tf.zeros_initializer())
worktype = tf.assign(worktype, 1)
workcondition = tf.equal(worktype, 1)
elem = tf.cond(workcondition, lambda: train_iterator.get_next(), lambda: val_iterato\
net = vgg16cnn2(elem[0],numclasses=256)
img = elem[0]
centropy = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=ele\
m[1],logits= net))
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES, scope)
regloss = 0.05 * tf.reduce_sum(reg_losses)
total_loss = centropy + regloss
t1 = tf.summary.scalar("Training Batch Loss", total_loss)
predictions = tf.cast(tf.argmax(tf.nn.softmax(net), 1), tf.int32)
correct_predictions = tf.cast(tf.equal(predictions, elem[1]), tf.float32)
batch_accuracy = tf.reduce_mean(correct_predictions)
t2 = tf.summary.scalar("Training Batch Accuracy", batch_accuracy)
grads = optim.compute_gradients(total_loss)
So basically based on the value of worktype, a minibatch will be taken from training or validation set.
When I run this code, I get the following LookUp Error :
LookupError: No gradient defined for operation 'GPU-Tower-0/cond/IteratorGetNext_1' (op type: IteratorGetNext)
Why does TensorFlow think that IteratorGetNext_1 requires a gradient ? How can I remedy this ?
The variable worktype is marked as trainable. By default, Optimizer.compute_gradients(...) computes the gradients for all trainable variables.
There are two ways you could solve this:
Set trainable=False in tf.get_variable(...).
Explicitly specify the variables for which the gradients should be computed with the var_list argument of Optimizer.compute_gradients(...).