Is TensorRT "floating-point 16" precision mode non-deterministic on Jetson TX2? - deterministic

I'm using TensorRT FP16 precision mode to optimize my deep learning model. And I use this optimised model on Jetson TX2. While testing the model, I have observed that TensorRT inference engine is not deterministic. In other words, my optimized model gives different FPS values between 40 and 120 FPS for same input images.
I started to think that the source of the non-determinism is floating point operations when I see this comment about CUDA:
"If your code uses floating-point atomics, results may differ from run
to run because floating-point operations are generally not
associative, and the order in which data enters a computation (e.g. a
sum) is non-deterministic when atomics are used."
Is type of precision such as FP16, FP32 and INT8 affects determinism of TensorRT? Or anything?
Do you have any thoughs?
Best regards.

I solved the problem by changing the function clock() that I used for measuring latencies. The clock() function was measuring the CPU time latency, but what I want to do is to measure real time latency. Now I am using std::chrono to measure the latencies. Now inference results are latency-deterministic.
That was wrong one, (clock())
int main ()
{
clock_t t;
int f;
t = clock();
inferenceEngine(); // Tahmin yapılıyor
t = clock() - t;
printf ("It took me %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC);
return 0;
}
Use Cuda Events like this, (CudaEvent)
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
inferenceEngine(); // Do the inference
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
Use chrono like this: (std::chrono)
#include <iostream>
#include <chrono>
#include <ctime>
int main()
{
auto start = std::chrono::system_clock::now();
inferenceEngine(); // Do the inference
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
std::cout << "finished computation at " << std::ctime(&end_time)
<< "elapsed time: " << elapsed_seconds.count() << "s\n";
}

Related

GPU Array multiplications using Pycuda on Numpy arrays

I have tried to implement Element-wise multiplication of two numpy arrays by making similar GPU arrays and performing the operations. However, the resulting execution time is much slower than the original numpy pointwise multiplication. I was hoping to get a good speedup using the GPU. zz0 is complex128 type, (64,256,16) shape numpy array and xx0 is float64 type,(16,151) shape numpy array. Can someone please help me figure out what I am doing wrong with respect to the implementation:
import sys
import numpy as np
import matplotlib.pyplot as plt
import pdb
import time
import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda.elementwise import ElementwiseKernel
import pycuda.gpuarray as gpuarray
import pycuda.cumath
import skcuda.linalg as linalg
linalg.init()
# Function for doing a point-wise multiplication using GPU
def calc_Hyp(zz,xx):
zz_stretch = np.tile(zz, (1,1,1,xx.shape[3]))
xx_stretch = np.tile(xx, (zz.shape[0],zz.shape[1],1,1))
zzg = gpuarray.to_gpu(zz_stretch)
xxg = gpuarray.to_gpu(xx_stretch)
zz_Hypg = linalg.multiply(zzg,xxg)
zz_Hyp = zz_Hypg.get()
return zz_Hyp
zz0 = np.random.uniform(10.0/5000, 20000.0/5000, (64,256,16)).astype('complex128')
xx0 = np.random.uniform(10.0/5000, 20000.0/5000, (16,151)).astype('float64')
xx0_exp = np.exp(-1j*xx0)
t1 = time.time()
#Using GPU for the calculation
zz0_Hyp = calc_Hyp(zz0[:,:,:,None],xx0_exp[None,None,:,:])
#np.save('zz0_Hyp',zz0_Hyp)
t2 = time.time()
print('Time taken with GPU:{}'.format(t2-t1))
#Original calculation
zz0_Hyp_actual = zz0[:,:,:,None]*xx0_exp[None,None,:,:]
#np.save('zz0_Hyp_actual',zz0_Hyp_actual)
t3 = time.time()
print('Time taken without GPU:{}'.format(t3-t2))
The first issue is that your timing metrics are not accurate.
Linalg compiles cuda modules on the fly, and you may see code being compiles as you run it. I made some slight modifications to your code to reduce the size of the arrays being multiplied, but regardless, after two runs with no other improvements I saw massive gains in performance ex:
Time taken with GPU:2.5476348400115967
Time taken without GPU:0.16627931594848633
vs
Time taken with GPU:0.8741757869720459
Time taken without GPU:0.15836167335510254
However that is still much slower than the CPU version. The next thing I did was give a more accurate timing based upon where the actual computation is happening. You aren't tiling in your numpy version, so don't time it in your cuda version:
REAL Time taken with GPU:0.6461708545684814
You also copy to the GPU, and include that in the calculation, but that in itself takes a non trivial amount of time, so lets remove that:
t1 = time.time()
zz_Hypg = linalg.multiply(zzg,xxg)
t2 = time.time()
...
REAL Time taken with GPU:0.3689603805541992
Wow, that contributed a lot. But we still are slower than the numpy version? Why?
Remember when I said that numpy doesn't tile? It doesn't copy memory at all for broad casting. To get the real speed, you would have to:
not Tile
broadcast dimensions
implement this in a kernel.
Pycuda provides the utilities for kernel implementation, but its GPU array does not provide broadcasting. Essentially what you would have to do is this (DISCLAIMER: I haven't tested this, there are probably bugs, this is just to demonstrate approximately what the kernel should look like):
#include <pycuda-complex.hpp>
//KERNEL CODE
constexpr unsigned work_tile_dim = 32
//instruction level parallelism factor, how much extra work to do per thread, may be changed but effects the launch dimensions. thread group size should be (tile_factor, tile_factor/ilp_factor)
constexpr unsigned ilp_factor = 4
//assuming c order:
// x axis contiguous out,
// y axis contiguous in zz,
// x axis contiguous in xx
// using restrict because we know that all pointers will refer to different parts of memory.
__global__
void element_wise_multiplication(
pycuda::complex<double>* __restrict__ array_zz,
pycuda::complex<double>* __restrict__ array_xx,
pycuda::complex<double>* __restrict__ out_array,
unsigned array_zz_w, /*size of w,z,y, dimensions used in zz*/
unsigned array_zz_z,
unsigned array_zz_xx_y,/*size of y,x, dimensions used in xx, but both have same y*/
unsigned array_xx_x){
// z dimensions in blocks often have restrictions on size that can be fairly small, and sometimes can cause performance issues on older cards, we are going to derive x,y,z,w index from just the x and y indicies instead.
unsigned x_idx = blockIdx.x * (work_tile_dim) + threadIdx.x
unsigned y_idx = blockIdx.y * (work_tile_dim) + threadIdx.y
//blockIdx.z stores both z and w and should not over shoot, and aren't used
//shown for the sake of how to get these dimensions.
unsigned z_idx = blockIdx.z % array_zz_z;
unsigned w_idx = blockIdx.z / array_zz_z;
//we already know this part of the indexing calculation.
unsigned out_idx_zw = blockIdx.z * (array_zz_xx_y * array_xx_z);
// since our input array is actually 3D, this is a different calcualation
unsigned array_zz_zw = blockIdx.z * (array_zz_xx_y)
//ensures if our launch dimensions don't exactly match our input size, we don't
//accidently access out of bound memory, while branching can be bad, this isn't
// because 99.999% of the time no branch will occur and our instruction pointer
//will be the same per warp, meaning virtually zero cost.
if(x_idx < array_xx_x){
//moving over y axis to coalesce memory accesses in the x dimension per warp.
for(int i = 0; i < ilp_factor; ++i){
//need to also check y, these checks are virtually cost-less
// because memory access dominates time in such simple calculations,
// and arithmetic will be hidden by overlapping execution
if((y_idx+i) < array_zz_xx_y){
//splitting up calculation for simplicity sake
out_array_idx = out_idx_zw+(y_idx+i)*array_xx_x + x_idx;
array_zz_idx = array_zz_zw + (y_idx+i);
array_xx_idx = ((y_idx+i) * array_xx_x) + x_idx;
//actual final output.
out_array[out_array_idx] = array_zz[array_zz_idx] * array_xx[array_xx_idx];
}
}
}
}
You will have to make the launch dimensions something like:
thread_dim = (work_tile_dim, work_tile_dim/ilp_factor) # (32,8)
y_dim = xx0.shape[0]
x_dim = xx0.shape[1]
wz_dim = zz0.shape[0] * zz0.shape[1]
block_dim = (x_dim/work_tile_dim, y_dim/work_tile_dim, wz_dim)
And there are several further optimizations you may be able to take advantage of:
store global memory accesses in work tile in shared memory inside of kernel, this ensures that accesses to zz0s "y", but really x dimension are coallesced when put into shared memory, increasing performance, then accessed from shared memory (where coalescing doesn't matter, but bank conflicts do). See here on how to deal with that kind of bank conflict.
instead of calculating eulers formula and expanding a double into a complex double, expand it inside of the kernel itself, use sincos(-x, &out_sin, &out_cos) to achieve the same result, but utilizing way less memory bandwidth (see here).
But note, even doing this will likely not give you the performance you want (though will still likely be faster) unless you are on a higher end GPU with full double precision units, which aren't on most GPUs (most of the time it is emulated). Double precision floating point units take up a lot of space, and since gpus are used for graphics, they don't have much use for double precision. If you want higher precision than floating point, but want to take advantage of floating point hardware with out a 1/8 to 1/32 throughput hit of double, you can use the techniques used in this answer to achieve this on the gpu, getting you closer to 1/2 to 1/3 throughput.

OpenCL 2.x - Sum Reduction function

From this previous post: strategy-for-doing-final-reduction, I would like to know the last functionalities offered by OpenCL 2.x (not 1.x which is the subject of this previous post above), especially about the atomic functions which allow to perform reductions of a array (in my case a sum reduction).
One told me that performances of OpenCL 1.x atomic functions (atom_add) were bad and I could check it, so I am looking for a way to get the best performances for a final reduction function (i.e the sum of each computed sum corresponding to each work-group).
I recall the typical kind of kernel code that I am using for the moment :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
As you can see, at the end of kernel code execution, I get the array partialSums[number_of_workgroups] containing all partial sums.
Could you tell me please how to perform a second and final reduction of this array, with the best performances possibles of functions availables with OpenCL 2.x . A classic solution is to perform this final reduction with CPU but ideally, I would like to do it directly with kernel code.
A suggestion of code snippet is welcome.
A last point, I am working on MacOS High Sierra 10.13.5 with the following model :
Can OpenCL 2.x be installed on my hardware MacOS model ?
Atomic functions should be avoided because they do harm performance compared to a parallel reduction kernel. Your kernel looks to be on the right track, but you need to remember that you'll have to invoke it multiple times; do not perform the final sum on the host (unless you have a very small amount of data from the previous reduction). That is, you need to keep invoking it until your local size equals your global size. There's no way to do a single invocation for large amounts of data as there is no way to synchronize between work groups.
Additionally, you want to be careful to set an appropriate work group size (i.e. local size), which depends on local & global memory throughput & latency. Unfortunately, as far as I'm aware there is no way to determine this through OpenCL, outside of self-profiling code, though that's not too difficult to write as OCL provides you with JIT compilation. Through empirical testing I've found you should find a sweet spot between suffering too many bank conflicts (too large a local size) vs. global memory latency penalties (too small a local size). It's best to do a benchmark first to determine optimal local size for your reduction, and then use that local size for future reductions.
Edit: It's also worth noting that the best way to chain your kernel invocation together is through OpenCL events.

Running session using tensorflow c++ api is significantly slower than using python

I am trying to run SqueezeDet using tensorflow c++ api (CPU only). I have freezed tensorflow graph and loaded it from C++. While in terms of detection quality everything is fine, performance is much slower than in python. What can be the reason of that?
Simplified, my code looks like this:
int main (int argc, const char * argv[])
{
// Initializing graph
tensorflow::GraphDef graph_def;
// Folder in which graph data is located
string graph_file_name = "Model/graph.pb";
// Loading graph
tensorflow::Status graph_loaded_status = ReadBinaryProto(tensorflow::Env::Default(), graph_file_name, &graph_def);
if (!graph_loaded_status.ok())
{
cout << graph_loaded_status.ToString() << endl;
return 1;
}
unique_ptr<tensorflow::Session> session_sqdet(tensorflow::NewSession(tensorflow::SessionOptions()));
tensorflow::Status session_create_status = session_sqdet->Create(graph_def);
if (!session_create_status.ok())
{
cout << "Session create status: fail." << endl;
return 1;
}
while ()
{
/* create & preprocess batch */
session.Run({{ "image_input", input_tensor}, {"keep_prob", prob_tensor}}, {"probability/score", "bbox/trimming/bbox"}, {}, &final_output);
/* do some postprocessing */
}
}
What I have tried:
1) Using optimization flags - all are on, no warnings.
2) Using batching: performance increased, but the gap between python and C++ is still significant (running session takes 1s vs 2.4s with batch_size = 20).
Any help would be highly appreciated.
I've spent a lot of time on that problem (most of it because of stupid mistakes I made), but I finally solved it. Now I want to post here my experience as it might be useful.
So those are steps I'd advice to follow someone facing the same issue (some of them are quite obvious, though):
0) Do the profiling properly! Be sure you are using tools reliable in multicore/GPU/whatever setting you have.
1) Check that tensorflow and all related packages are built with all optimizations on.
2) Optimize the graph after freezing.
3) In case you are using different batch sizes during training and inference, make sure that you have removed all the dependencies in the model! Note that otherwise you won't have an error message or even worse performance in terms of results quality, you'll only have a mysterious slowdown!

Speeding up CUDA atomics calculation for many bins/few bins

I am trying to optimize my histogram calculations in CUDA. It gives me an excellent speedup over corresponding OpenMP CPU calculation. However, I suspect (in keeping with intuition) that most of the pixels fall into a few buckets. For argument's sake, assume that we have 256 pixels falling into let us say, two buckets.
The easiest way to do it is to do it appears to be
Load the variables into shared memory
Do vectorized loads for unsigned char, etc. if needed.
Do an atomic add in shared memory
Do a coalesced write to global.
Something like this:
__global__ void shmem_atomics_reducer(int *data, int *count){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int block_reduced[NUM_THREADS_PER_BLOCK];
block_reduced[threadIdx.x] = 0;
__syncthreads();
atomicAdd(&block_reduced[data[tid]],1);
__syncthreads();
for(int i=threadIdx.x; i<NUM_BINS; i+=NUM_BINS)
atomicAdd(&count[i],block_reduced[i]);
}
The performance of this kernel drops (naturally) when we decrease the number of bins, from around 45 GB/s at 32 bins to around 10 GB/s at 1 bin. Contention, and shared memory bank conflicts are given as reasons. I don't know if there is any way to remove either of these for this calculation in any significant way.
I've also been experimenting with another (beautiful) idea from the parallelforall blog involving warp level reductions using __ballot to grab warp results and then using __popc() to do the warp level reduction.
__global__ void ballot_popc_reducer(int *data, int *count ){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
uint warp_id = threadIdx.x >> 5;
//need lane_ids since we are going warp level
uint lane_id = threadIdx.x%32;
//for ballot
uint warp_set_bits=0;
//to store warp level sum
__shared__ uint warp_reduced_count[NUM_WARPS_PER_BLOCK];
//shared data
__shared__ uint s_data[NUM_THREADS_PER_BLOCK];
//load shared data - could store to registers
s_data[threadIdx.x] = data[tid];
__syncthreads();
//suspicious loop - I think we need more parallelism
for(int i=0; i<NUM_BINS; i++){
warp_set_bits = __ballot(s_data[threadIdx.x]==i);
if(lane_id==0){
warp_reduced_count[warp_id] = __popc(warp_set_bits);
}
__syncthreads();
//do warp level reduce
//could use shfl, but it does not change the overall picture
if(warp_id==0){
int t = threadIdx.x;
for(int j = NUM_WARPS_PER_BLOCK/2; j>0; j>>=1){
if(t<j) warp_reduced_count[t] += warp_reduced_count[t+j];
__syncthreads();
}
}
__syncthreads();
if(threadIdx.x==0){
atomicAdd(&count[i],warp_reduced_count[0]);
}
}
}
This gives decent numbers (well, that is moot - peak device mem bw is 133 GB/s, things seem to depend on launch configuration) for the single bin case (35-40 GB/s for 1 bin, as against 10-15 GB/s using atomics), but performance drops drastically when we increase the number of bins. When we run with 32 bins, performance drops to about 5 GB/s. The reason might perhaps be because of the single thread looping through all the bins, asking for parallelization of the NUM_BINS, loop.
I have tried several ways of going about parallelizing the NUM_BINS loop, none of which seem to work properly. For example, one could (very inelegantly) manipulate the kernel to create some blocks for each bin. This seems to behave the same way, possibly because we would again suffer from contention with multiple blocks attempting to read from global memory. Plus, the programming is clunky. Likewise, parallelizing in the y direction for bins gives similarly uninspiring results.
The other idea I tried just for kicks was dynamic parallelism, launching a kernel for each bin. This was disastrously slow, possibly owing to no real compute work for the child kernels and the launch overhead.
The most promising approach seems to be - from Nicholas Wilt's article
on using these so-called privatized histograms containing bins for each thread in shared memory, which would ostensibly be very heavy on shmem usage (and we only have 48 kB per SM on Maxwell).
Perhaps someone could shed some insight into the problem? I feel that one ought to go change the algorithm instead so as not to use histograms, to use something less frequentist. Otherwise, I suppose we just use the atomics version.
Edit: The context for my problem is in computing probability density functions to be used for pattern-classification. We can compute approximate histograms (more precisely, pdfs) by using non-parametric methods such as Parzen Windows or Kernel Density Estimation. However, this does not overcome the problem of dimensionality as we need to sum over all data points for every bin, which becomes expensive when the number of bins becomes large. See here: Parzen
I faced similar chalanges to work with clustering, but in the botton end, the best solution was to use the scan pattern to group the processing. So, I don't think that it would work for you. Since you asked for some experience in this are, I'll share mine with you.
The issues
In your 1st code, I guess that the deal with the low performance with the number of bins reduction is linked to warp stall, since you do perform very little processing for every evaluated data. When the number of bins is increased, the relation between processing and global memory load (data info) for that kernel is also increased. You can check that very easily with the "Issue Efficiency" Experiments at the Performance Analysis from Nsight. Probably you are getting a low rate of cycles with at least one elegible warp (Warp Issue Efficiency).
Since I was not able to improve the number of elegible warps to somewhere close to 95%, I gave up this approach, since for some cases it gets worse (the memory dependency stall 90% of my processing cycles.
The shuffle and vote reduction is very usefull if the number of bins is not to large. If it is to large, a small amount of threads should be active for every bin filter. So you may end up with a lot of code divergence, and that is very undesirable for parallel processing. You may try to group the divergence in order to remove branching and have a good control flow, so the whole warp/block presents a similar processing, but a lot chance across blocks.
A feasible solution
I don't know where, but there are very good solutions for your problem around that I saw. Did you tried this one?
Also you can use a vectorized load and try something like that, but I'm not sure how much would it improve your performance:
__global__ hist(int4 *data, int *count, int N, int rem, unsigned int init) {
__shared__ unsigned int sBins[N_OF_BINS]; // you may want to declare this one dinamically
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (threadIdx.x < N_OF_BINS) sBins[threadIdx.x] = 0;
for (int i = 0; i < N; i+= warpSize) {
atomicAdd(&sBins[data[i + init].w], 1);
atomicAdd(&sBins[data[i + init].x], 1);
atomicAdd(&sBins[data[i + init].y], 1);
atomicAdd(&sBins[data[i + init].z], 1);
}
//process remaining elements if the data is not multiple of 4
// using recast and a additional control
for (int i = 0; i < rem; i++) {
atomicAdd(&sBins[reinterpret_cast<int*>(data)[N * 4 + init + i]], 1);
}
//update your histogram data here
}

cuda uncorrectable ECC error encountered

My environment is
Windows 7 x64
Matlab 2012a x64
Cuda SDK 4.2
Tesla C2050 GPU
I am having trouble figuring out why my GPU is crashing with the "uncorrectable ECC error encountered". This error only occurs when i use 512 threads or more. I can't post the kernel, but i will try to describe what it does.
In general, the kernel takes a number of parameters and produces 2 complex matricies defined by the thread size, M and another number, N. So the returned matrices will be of size MxN. A typical configuration is 512x512, but each number is independent and can vary up or down. The kernel works when the numbers are 256x256.
Each thread (kernel) extracts a 999 size vector out of a 2D array based on the thread id, ie size 999xM, then cycles through the row (0 .. N-1) of the output matrices for calculation. A number of intermediate parameters are calculated, only using pow, sin and cos among the + - * / operators. To calculate one of the output matrices an additional loop needs to be executed to sum up the contribution of the 999 vector that was extracted earlier. This loop does some intermediate calculations to determine a range of values that will allow contribution. The contribution is then scaled by a factor determined by the cos and sine values of a calculated fractional value. This is where it crashes. If i stick in a constant value or 1.0 or any other for that matter, the kernel executes without trouble. however, when only one of the calls (cos or sine) is included, the kernel crashes.
Some psuedocode follows:
kernel()
{
/* Extract 999 vector from 2D array 999xM - one 999 vector for each thread. */
for (int i = 0; i < 999; i++)
{
.....
}
/* Cycle through the 2nd dimension of the output matricies */
for (int j = 0; j < N; j++)
{
/* Calculate some intermediate variables */
/* Calculate the real and imaginary components of the first output matrix */
/* real = cos(value), imaginary = sin(value) */
/* Construct the first output matrix from some intermediate variables and the real and imaginary components */
/* Calculate some more intermediate variables */
/* cycle through the extracted vector (0 .. 998) */
for (int k = 0; k < 999; k++)
{
/* Calculate some more intermediate variables */
/* Determine the range of allowed values to contribute to the second output matrix. */
/* Calculate the real and imaginary components of the second output matrix */
/* real = cos(value), imaginary = sin(value) */
/* This is were it crashes, unless real and imaginary are constant values (1.0) */
/* Sum up the contributions of the extracted vector to the second output matrix */
}
/* Construct the Second output matrix from some intermediate variables and the real and imaginary components */
}
}
I thought this could be due to a register limit, but the occupancy calculator indicates that this is not the case, I'm using less than the 32,768 registers with 512 threads. Can anyone give any suggestions as to what the cause of this could be?
Here is the ptasx info:
ptxas info : Compiling entry function '_Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_' for 'sm_20'
ptxas info : Function properties for _Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_
8056 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __internal_trig_reduction_slowpathd
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 53 registers, 232 bytes cmem[0], 144 bytes cmem[2], 28 bytes cmem[16]
tmpxft_00001d70_00000000-3_MexFunciton.cudafe1.cpp
"Uncorrectable ECC error" usually refers to a hardware failure. ECC is Error Correcting Code, a means to detect and correct errors in bits stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM every once in a great while, but "uncorrectable ECC error" indicates that several bits are coming out of RAM storage "wrong" - too many for the ECC to recover the original bit values.
This could mean that you have a bad or marginal RAM cell in your GPU device memory.
Marginal circuits of any kind may not fail 100%, but are more likely to fail under the stress of heavy use - and associated rise in temperature.
There are diagnostic utilities floating around to stress-test all the RAM banks of your PC to confirm or pinpoint which chip is failing, but I don't know of an analog for testing the device RAM banks of the GPU.
If you have access to another machine with a GPU of similar capability, try running your app on that machine to see how it behaves. If you don't get the ECC error on the second machine, this confirms that the problem is almost certainly in the hardware of the first machine. If you get the same ECC error on the second machine, then ignore everything I've written here and continue looking for your software bug. Unless your code is actually causing hardware damage, the chances of two machines having the same hardware failure are extremely small.