ERROR: engine.cpp (370) - Cuda Error in ~ExecutionContext: 77 - tensorflow

I do Int8 calibration using TensorRT.
Once calibration is completed and test the inference. I have error at stream.synchronize() in the following function.
No issue running on FP32 and FP16 engines. Only have error running at Int8 engine. What could be wrong?
def infer(engine, x, batch_size, context):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
outputs.append(HostDeviceMem(host_mem, device_mem))
#img = np.array(x).ravel()
im = np.array(x, dtype=np.float32, order='C')
im = im[:,:,::-1]
#im = im.transpose((2,0,1))
#np.copyto(inputs[0].host, x.flatten()) #1.0 - img / 255.0
np.copyto(inputs[0].host, im.flatten())
[cuda.memcpy_htod_async(inp.device,, stream) for inp in inputs]
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(, out.device, stream) for out in outputs]
# Synchronize the stream
# Return only the host outputs.

The following code has no error. Only engine.max_batch_size and batch_size are different.
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device,, stream) for inp in inputs]
# Run inference.
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(, out.device, stream) for out in outputs]
# Synchronize the stream
# Return only the host outputs.
return [ for out in outputs]


How to access a SPECIFIC label in Tensorflow Lite object?

I got this code down here and I don't know how to access the "category_name" attribute. If it detects a person, I want it to say "Hello" in the command prompt.
I tried a LOT of different syntaxes and it didn't work. Down below is an image of how the "list" object looks when I do the
. What we want is the "category_name". You can see in the code I tried an "IF" that didn't help too much, since it's detecting 3 models simultaneously, so I guess the array has 3 elements, which themselves have multiple elements.
Is there a beginner-friendly answer to this?
Note: I got a Raspberry Pi 4 B
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.
"""Main script to run the object detection routine."""
import argparse
import sys
import time
import cv2
from tflite_support.task import core
from tflite_support.task import processor
from tflite_support.task import vision
import utils
def run(model: str, camera_id: int, width: int, height: int, num_threads: int,
enable_edgetpu: bool) -> None:
"""Continuously run inference on images acquired from the camera.
model: Name of the TFLite object detection model.
camera_id: The camera id to be passed to OpenCV.
width: The width of the frame captured from the camera.
height: The height of the frame captured from the camera.
num_threads: The number of CPU threads to run the model.
enable_edgetpu: True/False whether the model is a EdgeTPU model.
# Variables to calculate FPS
counter, fps = 0, 0
start_time = time.time()
# Start capturing video input from the camera
cap = cv2.VideoCapture(camera_id)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
# Visualization parameters
row_size = 20 # pixels
left_margin = 24 # pixels
text_color = (0, 0, 255) # red
font_size = 1
font_thickness = 1
fps_avg_frame_count = 10
# Initialize the object detection model
base_options = core.BaseOptions(
file_name=model, use_coral=enable_edgetpu, num_threads=num_threads)
detection_options = processor.DetectionOptions(
max_results=3, score_threshold=0.3)
options = vision.ObjectDetectorOptions(
base_options=base_options, detection_options=detection_options)
detector = vision.ObjectDetector.create_from_options(options)
# Continuously capture images from the camera and run inference
while cap.isOpened():
success, image =
if not success:
'ERROR: Unable to read from webcam. Please verify your webcam settings.'
counter += 1
image = cv2.flip(image, 1)
# Convert the image from BGR to RGB as required by the TFLite model.
rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Create a TensorImage object from the RGB image.
input_tensor = vision.TensorImage.create_from_array(rgb_image)
# Run object detection estimation using the model.
detection_result = detector.detect(input_tensor)
#if detection_result[0].detections.categories.category_name)=='person':
#if getattr(detection_result, 'label') =='person':
# print("YES")
# Draw keypoints and edges on input image
image = utils.visualize(image, detection_result)
# Calculate the FPS
if counter % fps_avg_frame_count == 0:
end_time = time.time()
fps = fps_avg_frame_count / (end_time - start_time)
start_time = time.time()
# Show the FPS
fps_text = 'FPS = {:.1f}'.format(fps)
text_location = (left_margin, row_size)
cv2.putText(image, fps_text, text_location, cv2.FONT_HERSHEY_PLAIN,
font_size, text_color, font_thickness)
# Stop the program if the ESC key is pressed.
if cv2.waitKey(1) == 27:
cv2.imshow('object_detector', image)
def main():
parser = argparse.ArgumentParser(
help='Path of the object detection model.',
'--cameraId', help='Id of camera.', required=False, type=int, default=0)
help='Width of frame to capture from camera.',
help='Height of frame to capture from camera.',
help='Number of CPU threads to run the model.',
help='Whether to run the model on EdgeTPU.',
args = parser.parse_args()
run(args.model, int(args.cameraId), args.frameWidth, args.frameHeight,
int(args.numThreads), bool(args.enableEdgeTPU))
if _name_ == '_main_':

Modify and combine two different frozen graphs generated using tensorflow object detection API for inference

I am working with TensorFlow object detection API, I have trained two different(SSD-mobilenet and FRCNN-inception-v2) models for my use case. Currently, my workflow is like this:
Take an input image, detect one particular object using SSD
Crop the input image with the bounding box generated from
step 1 and then resize it to a fixed size(e.g. 200 X 300).
Feed this cropped and resized image to FRCNN-inception-V2 for detecting
smaller objects inside the ROI.
Currently at the time of inferencing, when I load two separate frozen graphs and follow the steps, I am getting my desired results. But I need only a single frozen graph because of my deployment requirement. I am new to TensorFlow and wanted to combine both graphs with crop and resizing process in between them.
Thanks, #matt and #Vedanshu for responding, Here is the updated code that works fine for my requirement, Please give suggestions, if it needs any improvement as I am still learning it.
# Dependencies
import tensorflow as tf
import numpy as np
# load graphs using pb file path
def load_graph(pb_file):
graph = tf.Graph()
with graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(pb_file, 'rb') as fid:
serialized_graph =
tf.import_graph_def(od_graph_def, name='')
return graph
# returns tensor dictionaries from graph
def get_inference(graph, count=0):
with graph.as_default():
ops = tf.get_default_graph().get_operations()
all_tensor_names = { for op in ops for output in op.outputs}
tensor_dict = {}
for key in ['num_detections', 'detection_boxes', 'detection_scores',
'detection_classes', 'detection_masks', 'image_tensor']:
tensor_name = key + ':0' if count == 0 else '_{}:0'.format(count)
if tensor_name in all_tensor_names:
tensor_dict[key] = tf.get_default_graph().\
return tensor_dict
# renames while_context because there is one while function for every graph
# open issue at
def rename_frame_name(graphdef, suffix):
for n in graphdef.node:
if "while" in
if "frame_name" in n.attr:
n.attr["frame_name"].s = str(n.attr["frame_name"]).replace("while_context",
"while_context" + suffix).encode('utf-8')
if __name__ == '__main__':
# your pb file paths
frozenGraphPath1 = '...replace_with_your_path/some_frozen_graph.pb'
frozenGraphPath2 = '...replace_with_your_path/some_frozen_graph.pb'
# new file name to save combined model
combinedFrozenGraph = 'combined_frozen_inference_graph.pb'
# loads both graphs
graph1 = load_graph(frozenGraphPath1)
graph2 = load_graph(frozenGraphPath2)
# get tensor names from first graph
tensor_dict1 = get_inference(graph1)
with graph1.as_default():
# getting tensors to add crop and resize step
image_tensor = tensor_dict1['image_tensor']
scores = tensor_dict1['detection_scores'][0]
num_detections = tf.cast(tensor_dict1['num_detections'][0], tf.int32)
detection_boxes = tensor_dict1['detection_boxes'][0]
# I had to add NMS becuase my ssd model outputs 100 detections and hence it runs out of memory becuase of huge tensor shape
selected_indices = tf.image.non_max_suppression(detection_boxes, scores, 5, iou_threshold=0.5)
selected_boxes = tf.gather(detection_boxes, selected_indices)
# intermediate crop and resize step, which will be input for second model(FRCNN)
cropped_img = tf.image.crop_and_resize(image_tensor,
tf.zeros(tf.shape(selected_indices), dtype=tf.int32),
[300, 60] # resize to 300 X 60
cropped_img = tf.cast(cropped_img, tf.uint8, name='cropped_img')
gdef1 = graph1.as_graph_def()
gdef2 = graph2.as_graph_def()
g1name = "graph1"
g2name = "graph2"
# renaming while_context in both graphs
rename_frame_name(gdef1, g1name)
rename_frame_name(gdef2, g2name)
# This combines both models and save it as one
with tf.Graph().as_default() as g_combined:
x, y = tf.import_graph_def(gdef1, return_elements=['image_tensor:0', 'cropped_img:0'])
z, = tf.import_graph_def(gdef2, input_map={"image_tensor:0": y}, return_elements=['detection_boxes:0'])
tf.train.write_graph(g_combined, "./", combinedFrozenGraph, as_text=False)
You can load output of one graph into another using input_map in import_graph_def. Also you have to rename the while_context because there is one while function for every graph. Something like this:
def get_frozen_graph(graph_file):
"""Read Frozen Graph file from disk."""
with tf.gfile.GFile(graph_file, "rb") as f:
graph_def = tf.GraphDef()
return graph_def
def rename_frame_name(graphdef, suffix):
# Bug reported at
for n in graphdef.node:
if "while" in
if "frame_name" in n.attr:
n.attr["frame_name"].s = str(n.attr["frame_name"]).replace("while_context",
"while_context" + suffix).encode('utf-8')
l1_graph = tf.Graph()
with l1_graph.as_default():
trt_graph1 = get_frozen_graph(pb_fname1)
[tf_input1, tf_scores1, tf_boxes1, tf_classes1, tf_num_detections1] = tf.import_graph_def(trt_graph1,
return_elements=['image_tensor:0', 'detection_scores:0', 'detection_boxes:0', 'detection_classes:0','num_detections:0'])
input1 = tf.identity(tf_input1, name="l1_input")
boxes1 = tf.identity(tf_boxes1[0], name="l1_boxes") # index by 0 to remove batch dimension
scores1 = tf.identity(tf_scores1[0], name="l1_scores")
classes1 = tf.identity(tf_classes1[0], name="l1_classes")
num_detections1 = tf.identity(tf.dtypes.cast(tf_num_detections1[0], tf.int32), name="l1_num_detections")
# Make your output tensor
tf_out = # your output tensor (here, crop the input image with the bounding box generated from step 1 and then resize it to a fixed size(e.g. 200 X 300).)
connected_graph = tf.Graph()
with connected_graph.as_default():
l1_graph_def = l1_graph.as_graph_def()
g1name = 'ved'
rename_frame_name(l1_graph_def, g1name)
tf.import_graph_def(l1_graph_def, name=g1name)
trt_graph2 = get_frozen_graph(pb_fname2)
g2name = 'level2'
rename_frame_name(trt_graph2, g2name)
[tf_scores, tf_boxes, tf_classes, tf_num_detections] = tf.import_graph_def(trt_graph2,
input_map={'image_tensor': tf_out},
return_elements=['detection_scores:0', 'detection_boxes:0', 'detection_classes:0','num_detections:0'])
# Export the graph
with connected_graph.as_default():
cwd = os.getcwd()
path = os.path.join(cwd, 'saved_model')
shutil.rmtree(path, ignore_errors=True)
inputs_dict = {
"image_tensor": tf_input
outputs_dict = {
"detection_boxes_l1": tf_boxes_l1,
"detection_scores_l1": tf_scores_l1,
"detection_classes_l1": tf_classes_l1,
"max_num_detection": tf_max_num_detection,
"detection_boxes_l2": tf_boxes_l2,
"detection_scores_l2": tf_scores_l2,
"detection_classes_l2": tf_classes_l2
tf_sess_main, path, inputs_dict, outputs_dict

How to fix "Retval[0] has already been set" when serving saved model

I have a working SavedModel (ie. a saved model that works when restored in python) that fails when run on tensorflow serving.
The error message on the server is:
OP_REQUIRES failed at : Internal: Retval[0] has already been set.
The REST API returns 500 and specifies the node on the graph:
[[{{node _retval_loop/concat_0_0}}]
Exact Steps to Reproduce
( link to saved model. it can be restored and run in python successfully but will throw an error if run on a model server. (Takes an image as input:["loop/Exit_1:0"],feed_dict={"image_bytes:0": image})
Source code / logs
Relevant source code(I hope):
(contains a while loop with a concat in the body)
val, idx =tf.nn.top_k(softmax ,name="topk")
sentence = tf.Variable([vocab.start_id],False,name="sentence",)
sentence = tf.concat([sentence, idx[0]], 0)#
def cond(sentence,state):
return tf.math.not_equal(
def body(sentence,state):
input_seqs = tf.expand_dims([sentence[-1]], 1)
seq_embeddings = tf.nn.embedding_lookup(self.embedding_map,
embed = seq_embeddings
# In inference mode, use concatenated states for convenient feeding and
# fetching.
state_feed = tf.concat(axis=1, values=state, name="state")
# Placeholder for feeding a batch of concatenated states.
# state_feed = tf.placeholder(dtype=tf.float32,
# shape=[None,
# name="state_feed")
state_tuple = tf.split(value=state_feed, num_or_size_splits=2, axis=1)
# Run a single LSTM step.
lstm_outputs, new_state_tuple = lstm_cell(
inputs=tf.squeeze(embed, axis=[1]),
# Concatentate the resulting state.
state = tf.concat(axis=1, values=new_state_tuple, name="state")
# Stack batches vertically.
lstm_outputs = tf.reshape(lstm_outputs, [-1, lstm_cell.output_size])
with tf.variable_scope("logits") as logits_scope:
logits = tf.contrib.layers.fully_connected(
scope=logits_scope, reuse = True
softmax = tf.nn.softmax(logits, name="softmax")
self.softmax = softmax
val, idx = tf.nn.top_k(softmax, name="topk")
sentence = tf.concat([sentence,idx[0]],0)
self.output = sentence
return [sentence, state]
out = tf.while_loop(cond, body, [sentence, state],parallel_iterations=1,maximum_iterations=20,name="loop",shape_invariants=[tf.TensorShape([None]),tf.TensorShape([None,None])])
return out
fails with error:
W external/org_tensorflow/tensorflow/core/framework/] OP_REQUIRES failed at : Internal: Retval[0] has already been set.
It could be the output nodes in contains node types that contain Enter, Merge, LoopCond, Switch, Exit, Less, etc.

tensorflow error - you must feed a value for placeholder tensor 'in'

I'm trying to implement queues for my tensorflow prediction but get the following error -
you must feed a value for placeholder tensor 'in' with dtype float and shape [1024,1024,3]
The program works fine if I use the feed_dict, Trying to replace feed_dict with queues.
The program basically takes a list of positions and passes the image np array to the input tensor.
for each in positions:
y,x = each
images = img[y:y+1024,x:x+1024,:]
a = images.astype('float32')
q = tf.FIFOQueue(capacity=200,dtypes=dtypes)
enqueue_op = q.enqueue(a)
qr = tf.train.QueueRunner(q, [enqueue_op] * 1)
data = q.dequeue()
with tf.Session(graph=graph,config=tf.ConfigProto(log_device_placement=True)) as sess:
p_boxes = graph.get_tensor_by_name("cat:0")
p_confs = graph.get_tensor_by_name("sha:0")
y = [p_confs, p_boxes]
x = graph.get_tensor_by_name("in:0")
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord,sess=sess)
confs, boxes =
How can I make sure the input data that I populated to the queue is recognized while running the graph in the session.
In my original run I call the
confs, boxes =[p_confs, p_boxes], feed_dict=feed_dict_testing)
I'd suggest not using queues for this problem, and switching to the new API. In particular makes it easier to feed in data from a Python function. You can rewrite your code to be much simpler, as follows:
def generator():
for y, x in positions:
images = img[y:y+1024,x:x+1024,:]
yield images.astype('float32')
dataset =
generator, tf.float32, [1024, 1024, img.shape[3]])
# Add any extra transformations in here, like `dataset.batch()` or
# `dataset.repeat()`.
# ...
iterator = dataset.make_one_shot_iterator()
data = iterator.get_next()
Note that in your program, there's no connection between the data tensor and the graph you loaded in load_graph() (at least, assuming that load_graph() doesn't grab data from the global state!). You will probably need to use tf.import_graph_def() and the input_map argument to associate data with one of the tensors in your frozen graph (possibly "in:0"?) to complete the task.

Dequeueing from RandomShuffleQueue does not reduce size

In order to train a model I have encapsulated my model in a class.
I use a tf.RandomShuffleQueue to enqueue a list of filenames to.
However when I dequeue the elements they get dequeued but the size of the queue does not reduce.
Following are more specific questions followed by the code snippet :
If I have only 5 images for example, but steps range upto 100, would this result in the addfilenames called repeatedly automatically ? It does not give me any error on dequeuing so I am thinking that it is getting called automatically.
Why the size of the tf.RandomShuffleQueue is not changing ? It remains constant.
import os
import time
import functools
import tensorflow as tf
from Read_labelclsloc import readlabel
def ReadTrain(traindir):
# Returns a list of training images, their labels and a dictionay.
# The dictionary maps label names to integer numbers.
return trainimgs, trainlbls, classdict
def ReadVal(valdir, classdict):
# Reads the validation image labels.
# Returns a dictionary with filenames as keys and
# corresponding labels as values.
return valdict
def lazy_property(function):
# Just a decorator to make sure that on repeated calls to
# member functions, ops don't get created repeatedly.
# Acknowledgements :
attribute= '_cache_' + function.__name__
def decorator(self):
if not hasattr(self, attribute):
setattr(self, attribute, function(self))
return getattr(self, attribute)
return decorator
class ModelInitial:
def __init__(self, traindir, valdir):
self.traindir = traindir
self.valdir = valdir
self.epoch = 0
def traininginfo(self):
self.trainimgs, self.trainlbls, self.classdict = ReadTrain(self.traindir)
self.valdict = ReadVal(self.valdir, self.classdict)
with self.graph.as_default():
self.trainimgs_tensor = tf.constant(self.trainimgs)
self.trainlbls_tensor = tf.constant(self.trainlbls, dtype=tf.uint16)
self.trainimgs_dict = {}
self.trainimgs_dict["ImageFile"] = self.trainimgs_tensor
return None
def graph(self):
g = tf.Graph()
with g.as_default():
# Layer definitions go here
return g
def addfilenames (self):
# This is the function where filenames are pushed to a RandomShuffleQueue
filename_queue = tf.RandomShuffleQueue(capacity=len(self.trainimgs), min_after_dequeue=0,\
dtypes=[tf.string], names=["ImageFile"],\
seed=0, name="filename_queue")
sz_op = filename_queue.size()
dq_op = filename_queue.dequeue()
enq_op = filename_queue.enqueue_many(self.trainimgs_dict)
return filename_queue, enq_op, sz_op, dq_op
def Train(self):
# The function for training.
# I have not written the training part yet.
# Still struggling with preprocessing
with self.graph.as_default():
filename_q, filename_enqueue_op, sz_op, dq_op= self.addfilenames
qr = tf.train.QueueRunner(filename_q, [filename_enqueue_op])
filename_dequeue_op = filename_q.dequeue()
init_op = tf.global_variables_initializer()
sess = tf.Session(graph=self.graph)
coord = tf.train.Coordinator()
enq_threads = qr.create_threads(sess, coord=coord, start=True)
counter = 0
for step in range(100):
print("Epoch = %d "%(self.epoch))
print("size = %d"%(
names = [ for n in self.graph.as_graph_def().node]
print("Counter = %d"%(counter))
return None
if __name__ == "__main__":
modeltrain = ModelInitial(<Path to training images>,\
<Path to validation images>)
a = modeltrain.graph
The mystery is caused by the tf.train.QueueRunner that you created for the queue, which causes it to be filled in the background.
The following lines cause a background "queue runner" thread to be created:
qr = tf.train.QueueRunner(filename_q, [filename_enqueue_op])
# ...
enq_threads = qr.create_threads(sess, coord=coord, start=True)
This thread calls filename_enqueue_op in a loop, which causes the queue to be filled up as you remove elements from it.
The background thread from step 1 will almost always have a pending enqueue operation (filename_enqueue_op) on the queue. This means that after you dequeue a filename, the pending enqueue will run add fill the queue back up to capacity. (Technically there is a race condition here and you could see a size of capacity - 1, but this is quite unlikely).