Memory leak caused by incomplete executeAsync() function execution - tensorflow

As the title suggests, I was inferring using executeAsync() for an object detection task. The function works well when I have a detection continuously in my live video and there's no memory leak. However, if I have no detection in one particular frame, there will be an error while it executes the executeAsync() function.
TypeError: Cannot read property '0' of undefined
And here's the memory leak that happened when there's no detection.
{unreliable: false, numBytesInGPU: 422861990, numTensors: 1111, numDataBuffers: 673, numBytes: 320663096}
{unreliable: false, numBytesInGPU: 490830799, numTensors: 1263, numDataBuffers: 752, numBytes: 371606568}
{unreliable: false, numBytesInGPU: 558799608, numTensors: 1415, numDataBuffers: 831, numBytes: 422550040}
I am thinking the issue lies in the dangling tensors that are created in executeAsync() since if there's a detection I would dispose the tensors, but if there's no detection, I couldn't get a reference of the dangling tensors since the test_result variable hasn't been created yet due to the aforementioned error.
I wanted to use tf.tidy() but I did some research on it and apparently it can't be used with asynchronous functions.
How can I eradicate the memory leak? Or is there a problem with my model inference? Really appreciate any help.
async detect(video){
console.log("detecting...");
var example = tf.browser.fromPixels(video);
var tf4d = example.expandDims(0);
try{
// error if there's no detection
const test_result = await this.model.executeAsync({ image_tensor: tf4d }, ['detection_boxes', 'num_detections', 'detection_classes', 'detection_scores']) as tf.Tensor[];
this.no_detection = false;
var detection_boxes = test_result[0].dataSync();
var num_detections = test_result[1].dataSync();
var detection_classes = test_result[2].dataSync();
var detection_scores = test_result[3].dataSync();
// dispose tensors to avoid memory leak
await Promise.all(test_result.map(t => t.data()));
test_result.map(t => t.dispose());
tf.dispose(test_result);
tf4d.dispose();
example.dispose();
console.log(tf.memory());
}
catch(error){
console.log("No detection");
console.log(error)
this.no_detection = true;
// dispose of tensor created from images if theres no detection
tf4d.dispose();
example.dispose();
console.log(tf.memory());
}
}
Edited: I think it's also worth mentioning that I retrained my own model using the Tensorflow Object Detection API following this tutorial.
I think there might also be some problem with the exporting of inference graph using export_inference_graph.pb since I also got this
23 ops no flops stats due to incomplete shapes.
while converting.
I tried converting from a pretrained inference graph to a tfjs model and it works fine - if there's no detection executeAsync() would just return an empty array.
That's the reason why I think maybe there's a problem with the exporting of inference graph of my own model..?
Here's the command (I tried both with and without the --input_shape argument):
python export_inference_graph.py --input_type image_tensor --input_shape 1,300,300,3 --pipeline_config_path training/ssd_mobilenet_v2.config --trained_checkpoint_prefix training/model.ckpt-38153 --output_directory trained-inference-graphs/output_inference_graph

Related

TensorFlow : how to fix createtflitesimdmodule of tflite returning empty buffers

i dont understand the problem from where its coming but when i call createtflitesimdmodule from the tflite.simd file it return empty buffers but before it was working as expected and when i call this function tflite._getModelBufferMemoryOffset() return 0, what is the missing thing, is there any declaration to do before.
import createTFLiteSIMDModule from './tflite/tflite-simd.js';
const tflite = await createTFLiteSIMDModule();
const modelBufferOffset = tflite._getModelBufferMemoryOffset();
this is the result of console.log(tflite)
Providing solution here for the benefit of the community.
The issue was resolved by updating the files, Ref link.

calculator node for pose based action recognition

I want to add a action_reconition calculator node to the pose_landmark detector (pose_landmark_gpu.pbtxt). Does anyone know if there is already a calculator implementation suited for that purpose?
i.e.
Input: pose landmarks
Inference via tflite model
Output: probability values for the respective action classes
I've seen that the original pose landmark detector uses tensors_to_landmarks_calculator.cc. I would need a similar file but for different input & output types. Any idea if there is a "template" cc file that I could adapt to my use case?
Just for better understanding, here is my edited pbtxt of the pose_landmark detector with an additional node for action classification:
# GPU buffer. (GpuBuffer)
input_stream: "input_video"
output_stream: "output_video" # Output image with rendered results. (GpuBuffer)
output_stream: "pose_landmarks" # Pose landmarks. (NormalizedLandmarkList)
output_stream: "action_detection" # Action Probabilities
node {
calculator: "FlowLimiterCalculator"
input_stream: "input_video"
input_stream: "FINISHED:output_video"
input_stream_info: {
tag_index: "FINISHED"
back_edge: true
}
output_stream: "throttled_input_video"
}
# Subgraph that detects poses and corresponding landmarks.
node {
calculator: "PoseLandmarkGpu"
input_stream: "IMAGE:throttled_input_video"
output_stream: "LANDMARKS:pose_landmarks"
output_stream: "DETECTION:pose_detection"
output_stream: "ROI_FROM_LANDMARKS:roi_from_landmarks"
}
# Subgraph that renders pose-landmark annotation onto the input image.
node {
calculator: "PoseRendererGpu"
input_stream: "IMAGE:throttled_input_video"
input_stream: "LANDMARKS:pose_landmarks"
input_stream: "ROI:roi_from_landmarks"
input_stream: "DETECTION:pose_detection"
output_stream: "IMAGE:output_video"
}
# Subgraph that detects actions from poses
node {
calculator: "ActionDetectorGPU"
input_stream: "LANDMARKS:pose_landmarks"
output_stream: "ACTION:action_detection"
}
Update
There is a open source project called SigNN, that does the same thing as I'm intending just for hand pose classification (into american sign language letters). I'm going to plow through that...
Here is a more general formulation of a similar problem. There is a solution using MediaPipeUnityPlugin (but the same graph would also work in pure mediapipe, though there is no released driver code at the time of writing this)

In tensorflow, when calling tf.memory(), what does unreliable: true mean?

My call to tf.memory() results in:
{
unreliable: true,
numTensors: 125289,
numDataBuffers: 125289,
numBytes: 12289704
}
It does not provide a reason for being unreliable, as the documentation suggests, and I cannot find information about what it means.
What does it mean for the memory to be unreliable in tensorflow?

Tensorflow new Op CUDA kernel memory management

I am have implemented a rather complex new Op in Tensorflow with a GPU CUDA kernel.
This Op requires a lot of dynamic memory allocation of variables which are not tensors and are deallocated after the op is done, more specifically it involves using a hash table.
Right now I am using cudaMalloc() and cudaFree() but I have noticed Tensorflow has its own type called Eigen::GPUDevice which has the ability to allocate and deallocate memory on the GPU.
My questions:
Is it best practice to use Eigen::GPUDevice to manage GPU memory;
By using Eigen::GPUDevice instead of the CUDA API I am "automatically" enabling multi-GPU support since different GPUDevices can be passed to the Op;
Should I extend this idea to the CPU kernel and see if there is a CPUDevice type which also manages the memory instead of using C++ syntax (i.e. auto var = new int[100]; delete[] var)
The is no direct public guideline for this issue. I usually just let the TensorFlow allocate this information by
template<typename Device, typename Dtype>
class MyOp: public OpKernel {
{
public:
explicit MyOp(OpKernelConstruction *context) :
OpKernel(context)
{
// ...
}
void Compute(OpKernelContext *context) override
{
Tensor* tmp_var = nullptr;
Tensor* output = nullptr;
TensorShape some_shape, some_shape2;
// temparily use this space
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DT_FLOAT, some_shape, &tmp_var));
// allocate memory for output tensor
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, some_shape2, &output));
whatever needs memory, should be allocated by the TensorFlow context and not by custom cudaMalloc or new type[num] calls.
the context should provide the information for the Allocator
see below
Consider, for the sake of simplicity just adding two matrices (full example).
TensorFlow-Operations usually contain the following structure:
Op description having REGISTER_OP, which is responsible for shape-checking, and setting the output shape (example)
OpKernel responsible for allocating memory, getting pointer to the inputs and setup stuff, (see above or this )
Functor for the implementation itself, like
Tensor* output = nullptr;
Tensor* tmp_var = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output));
OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var));
// the function does not need to care about the memory allocation as everything is already setup at this point
::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output);
You are just left by implementing
// gpu version
template <typename Dtype>
struct MyFunctor<GPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
// cpu version
template <typename Dtype>
struct MyFunctor<CPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
edit
allocate_persistent: use this if you need your data between Op invocations like one-time index structures.[example]
allocate_temp just tmp memory which will be not retained at the end of the Compute method lifetime. [example]
But I highly recommend reading the comment in the source-code here and then decided depending on your use case.
The best practice is to use the OpKernelContext::allocate_persistent() method to allocate memory, in the form of a tensorflow::Tensor, that outlives a single call to OpKernel::Compute(). It uses the appropriate Allocator* for the device, so if the kernel runs on a GPU device, it will allocate GPU memory for that particular device, and if it runs on a CPU device it will allocate CPU memory.

Designing an accumulating Tensorflow GPU operator

I'm designing a GPU op kernel that iteratively accumulates data in a buffer of GPU memory.
It's important that the data remains in GPU memory. So something along the lines of:
with tf.device('/gpu:0'):
buffer = tf.zeros(...)
buffer = accumulate(param11, param12, buffer)
buffer = accumulate(param21, param22, buffer)
buffer = accumulate(param31, param32, buffer)
with tf.device('/cpu:0'):
A = do_some_more_stuff(buffer)
I'd like some input on three approaches that I think can be used to accomplish this:
Allocate output tensor on each call and use that as an input tensor
on the next call. This is simple to implement but I'm concerned that
continual allocation of GPU memory will be an issue.
Will tensorflow release now unused allocations into the GPU memory pool?
REGISTER_OP("Accumulate")
.Input("param1: T")
.Input("param2: T")
.Input("buffer_in: T")
.Output("buffer_out: T")
void Compute(tensorflow::OpKernelContext * ctx) override
{
TensorShape output_shape{...};
Tensor * output_ptr = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(
0, output_shape, &output_ptr))
kernel<<<grid, blocks, 0, stream>>>(
ctx->input(0), ctx->input(1),
output);
}
Reference input and output tensors and ensure they're referring
to the same data. As I understand the standard ops and OpKernelContext
documentation, this needs to be protected with a mutex as other ops
may also be accessing the underlying referenced tensor...
REGISTER_OP("Accumulate")
.Input("param1: T")
.Input("param2: T")
.Input("buffer_in: Ref(T)")
.Output("buffer_out: Ref(T)")
void Compute(tensorflow::OpKernelContext * ctx) override
{
mutex_lock(mu_);
ctx->forward_ref_input_to_ref_output(2, 0);
kernel<<<grid, blocks, 0, stream>>>(
ctx->input(0), ctx->input(1),
ctx->mutable_input(2, true));
}
Use allocate_persistent() in conjunction with an OpKernelConstruction context
to provide a persistent buffer for accumulation. I'd prefer not to do this because
I'm dealing with variable buffer sizes and they'll probably be fairly large.
I'm not really sure what you're trying to do with your C++ code, but from looking at the python snippet I think tf.assign might help. It allows you to do things like this:
buffer = tf.Variable(...)
param = tf.Variable(...)
accumulate_op = buffer.assign(expr<param, buffer>)
...
sess.run(accumulate_op)
Running accumulate_op should update your buffer on the gpu (you may have to wrap it in a tf.group to avoid fetching the updated value).