MKL Sparse BLAS segfault when transposing CSR with 100M rows - blas

I am trying to use MKL Sparse BLAS for CSR matrices with number of rows/columns on the order of 100M. My source code that seems to work fine for 10M rows/columns fails with segfault when I increase it to 100M.
I isolated the problem to the following code snippet:
void TestSegfault1() {
float values[1] = { 1.0f };
int col_indx[1] = { 0 };
int rows_start[1] = { 0 };
int rows_end[1] = { 1 };
// Step 1. Create 1 x 100M matrix
// with single non-zero value at (0,0)
sparse_matrix_t A;
mkl_sparse_s_create_csr(
&A, SPARSE_INDEX_BASE_ZERO, 1, 100000000,
rows_start, rows_end, col_indx, values);
// Step 2. Transpose it to get 100M x 1 matrix
sparse_matrix_t B;
mkl_sparse_convert_csr(A, SPARSE_OPERATION_TRANSPOSE, &B);
}
This function segfaults in mkl_sparse_convert_csr with backtrace
#0 0x00000000004c0d03 in mkl_sparse_s_convert_csr_i4_avx ()
#1 0x0000000000434061 in TestSegfault1 ()
For slightly different code (but essentially the same) it has a little more detail:
#0 0x00000000008fc09b in mkl_serv_free ()
#1 0x000000000099949e in mkl_sparse_s_export_csr_data_i4_avx ()
#2 0x0000000000999ee4 in mkl_sparse_s_convert_csr_i4_avx ()
Apparently something goes bad in memory allocation. And it sure looks like some kind of integer overflow from the outside. The build of MKL I have uses MKL_INT = int = int32.
Is it indeed the case and the limit on number of rows I can have in Sparse BLAS CSR matrix is < 100M (looks more like ~65M)? Or am I doing it wrong?
EDIT 1: MKL version string is "Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for Intel(R) 64 architecture applications".
EDIT 2: Figured it out. There is indeed a subtle kind of integer overflow when allocating memory for internal per-thread buffers. At some point inside mkl_sparse_s_export_csr_data_i4_avx it attempts to allocate (omp_get_max_threads() + 1) * num_rows * 4 bytes; the number doesn't fit in 32-bit signed integer. Subsequent call to mkl_serv_malloc causes memory corruption and eventually segfault. One possible solution is to alter the number of OpenMP threads via omp_set_num_threads call.

Could you check your example on last version of MKL? I run it on MKL 11.3.2 and it passed correctly for 100M matrix. However it could depend on number of threads on your machine (size of matrix mult number of threads have to be less than max int). To prevent such issue I am strongly recommend to use ilp64 version of MKL libraries
Thanks,
Alex

check how this example works with the latest mkl 2019 u4.
compiling the example with ILP64 mode like as follows:
icc -I/opt/intel/compilers_and_libraries_2019/linux/mkl/include test_csr.cpp \
-L/opt/intel/compilers_and_libraries_2019/linux/mkl/lib/intel64 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -liomp5 -lpthread -lm -ldl
./a.out
mkl_sparse_convert_csr passed

Related

GPU Array multiplications using Pycuda on Numpy arrays

I have tried to implement Element-wise multiplication of two numpy arrays by making similar GPU arrays and performing the operations. However, the resulting execution time is much slower than the original numpy pointwise multiplication. I was hoping to get a good speedup using the GPU. zz0 is complex128 type, (64,256,16) shape numpy array and xx0 is float64 type,(16,151) shape numpy array. Can someone please help me figure out what I am doing wrong with respect to the implementation:
import sys
import numpy as np
import matplotlib.pyplot as plt
import pdb
import time
import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda.elementwise import ElementwiseKernel
import pycuda.gpuarray as gpuarray
import pycuda.cumath
import skcuda.linalg as linalg
linalg.init()
# Function for doing a point-wise multiplication using GPU
def calc_Hyp(zz,xx):
zz_stretch = np.tile(zz, (1,1,1,xx.shape[3]))
xx_stretch = np.tile(xx, (zz.shape[0],zz.shape[1],1,1))
zzg = gpuarray.to_gpu(zz_stretch)
xxg = gpuarray.to_gpu(xx_stretch)
zz_Hypg = linalg.multiply(zzg,xxg)
zz_Hyp = zz_Hypg.get()
return zz_Hyp
zz0 = np.random.uniform(10.0/5000, 20000.0/5000, (64,256,16)).astype('complex128')
xx0 = np.random.uniform(10.0/5000, 20000.0/5000, (16,151)).astype('float64')
xx0_exp = np.exp(-1j*xx0)
t1 = time.time()
#Using GPU for the calculation
zz0_Hyp = calc_Hyp(zz0[:,:,:,None],xx0_exp[None,None,:,:])
#np.save('zz0_Hyp',zz0_Hyp)
t2 = time.time()
print('Time taken with GPU:{}'.format(t2-t1))
#Original calculation
zz0_Hyp_actual = zz0[:,:,:,None]*xx0_exp[None,None,:,:]
#np.save('zz0_Hyp_actual',zz0_Hyp_actual)
t3 = time.time()
print('Time taken without GPU:{}'.format(t3-t2))
The first issue is that your timing metrics are not accurate.
Linalg compiles cuda modules on the fly, and you may see code being compiles as you run it. I made some slight modifications to your code to reduce the size of the arrays being multiplied, but regardless, after two runs with no other improvements I saw massive gains in performance ex:
Time taken with GPU:2.5476348400115967
Time taken without GPU:0.16627931594848633
vs
Time taken with GPU:0.8741757869720459
Time taken without GPU:0.15836167335510254
However that is still much slower than the CPU version. The next thing I did was give a more accurate timing based upon where the actual computation is happening. You aren't tiling in your numpy version, so don't time it in your cuda version:
REAL Time taken with GPU:0.6461708545684814
You also copy to the GPU, and include that in the calculation, but that in itself takes a non trivial amount of time, so lets remove that:
t1 = time.time()
zz_Hypg = linalg.multiply(zzg,xxg)
t2 = time.time()
...
REAL Time taken with GPU:0.3689603805541992
Wow, that contributed a lot. But we still are slower than the numpy version? Why?
Remember when I said that numpy doesn't tile? It doesn't copy memory at all for broad casting. To get the real speed, you would have to:
not Tile
broadcast dimensions
implement this in a kernel.
Pycuda provides the utilities for kernel implementation, but its GPU array does not provide broadcasting. Essentially what you would have to do is this (DISCLAIMER: I haven't tested this, there are probably bugs, this is just to demonstrate approximately what the kernel should look like):
#include <pycuda-complex.hpp>
//KERNEL CODE
constexpr unsigned work_tile_dim = 32
//instruction level parallelism factor, how much extra work to do per thread, may be changed but effects the launch dimensions. thread group size should be (tile_factor, tile_factor/ilp_factor)
constexpr unsigned ilp_factor = 4
//assuming c order:
// x axis contiguous out,
// y axis contiguous in zz,
// x axis contiguous in xx
// using restrict because we know that all pointers will refer to different parts of memory.
__global__
void element_wise_multiplication(
pycuda::complex<double>* __restrict__ array_zz,
pycuda::complex<double>* __restrict__ array_xx,
pycuda::complex<double>* __restrict__ out_array,
unsigned array_zz_w, /*size of w,z,y, dimensions used in zz*/
unsigned array_zz_z,
unsigned array_zz_xx_y,/*size of y,x, dimensions used in xx, but both have same y*/
unsigned array_xx_x){
// z dimensions in blocks often have restrictions on size that can be fairly small, and sometimes can cause performance issues on older cards, we are going to derive x,y,z,w index from just the x and y indicies instead.
unsigned x_idx = blockIdx.x * (work_tile_dim) + threadIdx.x
unsigned y_idx = blockIdx.y * (work_tile_dim) + threadIdx.y
//blockIdx.z stores both z and w and should not over shoot, and aren't used
//shown for the sake of how to get these dimensions.
unsigned z_idx = blockIdx.z % array_zz_z;
unsigned w_idx = blockIdx.z / array_zz_z;
//we already know this part of the indexing calculation.
unsigned out_idx_zw = blockIdx.z * (array_zz_xx_y * array_xx_z);
// since our input array is actually 3D, this is a different calcualation
unsigned array_zz_zw = blockIdx.z * (array_zz_xx_y)
//ensures if our launch dimensions don't exactly match our input size, we don't
//accidently access out of bound memory, while branching can be bad, this isn't
// because 99.999% of the time no branch will occur and our instruction pointer
//will be the same per warp, meaning virtually zero cost.
if(x_idx < array_xx_x){
//moving over y axis to coalesce memory accesses in the x dimension per warp.
for(int i = 0; i < ilp_factor; ++i){
//need to also check y, these checks are virtually cost-less
// because memory access dominates time in such simple calculations,
// and arithmetic will be hidden by overlapping execution
if((y_idx+i) < array_zz_xx_y){
//splitting up calculation for simplicity sake
out_array_idx = out_idx_zw+(y_idx+i)*array_xx_x + x_idx;
array_zz_idx = array_zz_zw + (y_idx+i);
array_xx_idx = ((y_idx+i) * array_xx_x) + x_idx;
//actual final output.
out_array[out_array_idx] = array_zz[array_zz_idx] * array_xx[array_xx_idx];
}
}
}
}
You will have to make the launch dimensions something like:
thread_dim = (work_tile_dim, work_tile_dim/ilp_factor) # (32,8)
y_dim = xx0.shape[0]
x_dim = xx0.shape[1]
wz_dim = zz0.shape[0] * zz0.shape[1]
block_dim = (x_dim/work_tile_dim, y_dim/work_tile_dim, wz_dim)
And there are several further optimizations you may be able to take advantage of:
store global memory accesses in work tile in shared memory inside of kernel, this ensures that accesses to zz0s "y", but really x dimension are coallesced when put into shared memory, increasing performance, then accessed from shared memory (where coalescing doesn't matter, but bank conflicts do). See here on how to deal with that kind of bank conflict.
instead of calculating eulers formula and expanding a double into a complex double, expand it inside of the kernel itself, use sincos(-x, &out_sin, &out_cos) to achieve the same result, but utilizing way less memory bandwidth (see here).
But note, even doing this will likely not give you the performance you want (though will still likely be faster) unless you are on a higher end GPU with full double precision units, which aren't on most GPUs (most of the time it is emulated). Double precision floating point units take up a lot of space, and since gpus are used for graphics, they don't have much use for double precision. If you want higher precision than floating point, but want to take advantage of floating point hardware with out a 1/8 to 1/32 throughput hit of double, you can use the techniques used in this answer to achieve this on the gpu, getting you closer to 1/2 to 1/3 throughput.

OpenCL 2.x - Sum Reduction function

From this previous post: strategy-for-doing-final-reduction, I would like to know the last functionalities offered by OpenCL 2.x (not 1.x which is the subject of this previous post above), especially about the atomic functions which allow to perform reductions of a array (in my case a sum reduction).
One told me that performances of OpenCL 1.x atomic functions (atom_add) were bad and I could check it, so I am looking for a way to get the best performances for a final reduction function (i.e the sum of each computed sum corresponding to each work-group).
I recall the typical kind of kernel code that I am using for the moment :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
As you can see, at the end of kernel code execution, I get the array partialSums[number_of_workgroups] containing all partial sums.
Could you tell me please how to perform a second and final reduction of this array, with the best performances possibles of functions availables with OpenCL 2.x . A classic solution is to perform this final reduction with CPU but ideally, I would like to do it directly with kernel code.
A suggestion of code snippet is welcome.
A last point, I am working on MacOS High Sierra 10.13.5 with the following model :
Can OpenCL 2.x be installed on my hardware MacOS model ?
Atomic functions should be avoided because they do harm performance compared to a parallel reduction kernel. Your kernel looks to be on the right track, but you need to remember that you'll have to invoke it multiple times; do not perform the final sum on the host (unless you have a very small amount of data from the previous reduction). That is, you need to keep invoking it until your local size equals your global size. There's no way to do a single invocation for large amounts of data as there is no way to synchronize between work groups.
Additionally, you want to be careful to set an appropriate work group size (i.e. local size), which depends on local & global memory throughput & latency. Unfortunately, as far as I'm aware there is no way to determine this through OpenCL, outside of self-profiling code, though that's not too difficult to write as OCL provides you with JIT compilation. Through empirical testing I've found you should find a sweet spot between suffering too many bank conflicts (too large a local size) vs. global memory latency penalties (too small a local size). It's best to do a benchmark first to determine optimal local size for your reduction, and then use that local size for future reductions.
Edit: It's also worth noting that the best way to chain your kernel invocation together is through OpenCL events.

Fastest Cython implementation depends on computer?

I am converting a python script to cython and optimizing it for more speed. Right now i have 2 versions, on my desktop V2 is twice as fast as V1 unfortunately on my laptop V1 is twice as fast as V2 and i am unable to find out why there is such a big difference.
Both computers use:
- Ubuntu 16.04
- Python 2.7.12
- Cython 0.25.2
- Numpy 1.12.1
Desktop:
- Intel® Core™ i3-4370 CPU # 3.80GHz × 4 64bit. 16GB RAM
Laptop:
- Intel® Core™ i5-3210 CPU # 2.5GHz × 2 64bit. 8GB RAM
V1 - you can find the full code here. the only changes made are renaming go.py, preprocessing.py to go.pyx, preprocessing.pyx and using
import pyximport; pyximport.install() to compile them. you can run test.py. This version is using a 2d numpy array board to store data in go.pyx and list comprehension in the get_board function in preprocessing.pyx to process data. during the test no function is called from go.py only the numpy array board is used
V2 - you can find the full code here. quite some stuff has changed, below you can find a list with everything affecting this test case. Be aware, all function and variable declarations have to be in go.pxd. you can run test.py using this command: python test.py build_ext --inplace
the 2d numpy array is replaced by:
cdef char board[ 362 ]
and the function get_board_feature in go.pyx replaces numpy list comprehension:
cdef char get_board_feature( self, short location ):
# return correct board feature value
# 0 active player stone
# 1 opponent stone
# 2 empty location
cdef char value = self.board[ location ]
if value == EMPTY:
return 2
if value == self.player_current:
return 0
return 1
get_board function in preprocessing.pyx is replaced with a function that loops over the array and calls get_board_feature in go.pyx for every location
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int get_board(self, GameState state, np.ndarray[double, ndim=2] tensor, int offSet ):
"""A feature encoding WHITE BLACK and EMPTY on separate planes, but plane 0
always refers to the current player and plane 1 to the opponent
"""
cdef short location
for location in range( 0, state.size * state.size ):
tensor[ offSet + state.get_board_feature( location ), location ] = 1
return offSet + 3
Please let me know if i should include any other information or run certain tests.
cmp, diff test
the V2 go.c and preprocessing.c files are identical.
V1 does not generate a .c file to compare
update compared .so files
the V2 go.so files are different:
goD.so goL.so differ: byte 473, line 1
the preprocessing.so files are identical, not sure what to think of that..
They are two different machines and behave differently. There's a reason why processor reviews use large benchmark suites. It could be said that the desktop CPU performs better on average, but execution times between two small but non-trivial pieces of codes does not 'have' to favor the desktop CPU. And differences execution times definitely do not have to follow any linear relationship. The performance is always dependant on a huge amount of factors. Possible explanations include but are not limited to the smaller L1 and L2 caches on the desktop and the change in vector instruction sets from AVX to AVX2 between the Ivy Bridge laptop and the Haswell desktop.
Generally it's a good idea to concentrate on using good algorithms and to identify and remove bottlenecks when optimizing performance. Trying to stare at benchmarks between different machines will probably only cause a headache.

numpy correlation coefficient: np.dot(A, A.T) on large arrays causing seg fault

NOTE:
Speed is not as important as getting a final result.
However, some speed up over worst case is required as well.
I have a large array A:
A.shape=(20000,265) # or possibly larger like 50,000 x 265
I need to compute the correlation coefficients.
np.corrcoeff # internally casts the results as doubles
I just borrowed their code and wrote my own cov/corr not casting into doubles, since I really only need 32 bit floats.And I ditch the conj() since my data are always real.
cov = A.dot(A.T)/n #where A is an array of 32 bit floats
diag = np.diag(cov)
corr = cov / np.sqrt(np.mutliply.outer(d,d))
I still run out of memory and I'm using a large memory machine, 264GB
I've been told, that the fast C libraries, are probably using a routine which breaks the
dot product up into pieces, and to optimize this, the number of elements is padded to a power of 2.
I don't really need to compute the symmetric half of the correlation coefficient matrix.
However, I don't see a way to do this in reasonable amount of time doing it "manually", with python loops.
Does anybody know of a way to ask numpy for a decent dot product routine, that balances memory usage with speed...?
Cheers
UPDATE:
Funny how writing these questions tends to help me find the language for a better google query.
Found this:
http://wiki.scipy.org/PerformanceTips
Not sure that I follow it....so, please comment or provide answers about this solution, your own ideas, or just general commentary on this type of problem.
TIA
EDIT: I apologize because my array is really much bigger than I thought.
array size is actually 151,000 x 265
I''m running out of memory on a machine with 264 GB with at least 230 GB free.
I'm surprised that the numpy call to blas dgemm and being careful with C order arrays
didn't do squat.
Python compiled with intel's mkl will run this with 12GB of memory in about 30 seconds:
>>> A = np.random.rand(50000,265).astype(np.float32)
>>> A.dot(A.T)
array([[ 86.54410553, 64.25226593, 67.24698639, ..., 68.5118103 ,
64.57299805, 66.69223785],
...,
[ 66.69223785, 62.01016235, 67.35866547, ..., 66.66306305,
65.75863647, 86.3017807 ]], dtype=float32)
If you do not have access to in intel's MKL download python anaconda and install the accelerate package which has a trial version for 30 days or free for academics that contains a mkl compile. Various other C++ BLAS libraries should work also- even if it copies the array from C to F it should not take more then ~30GB of memory.
The only thing that I can think of that your installation is trying to do is try to hold the entire 50,000 x 50,000 x 265 array in memory which is quite frankly terrible. For reference a float32 50,000 x 50,000 array is only 10GB, while the aforementioned array is 2.6TB...
If its a gemm issue you can try a chunk gemm formula:
def chunk_gemm(A, B, csize):
out = np.empty((A.shape[0],B.shape[1]), dtype=A.dtype)
for i in xrange(0, A.shape[0], csize):
iend = i+csize
for j in xrange(0, B.shape[1], csize):
jend = j+csize
out[i:iend, j:jend] = np.dot(A[i:iend], B[:,j:jend])
return out
This will be slower, but will hopefully get over your memory issues.
You can try and see if np.einsum works better than dot for your case:
cov = np.einsum('ij,kj->ik', A, A) / n
The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order, not sure if that's the case here. einsum will buffer its inputs, and use vectorized SIMD operations where possible, but outside that it is basically going to run the naive three nested loops to compute the matrix product.
UPDATE: Turns out the dot product completed with out error, but upon careful inspection
the output array consists of zeros at 95,000 to the end, of the 151,000 cols.
That is, out[:,94999] = non-zero but out[:,95000] = 0 for all rows...
This is super annoying...
Another Blas description
The exchange, mentions something that I thought about too...Since blas is fortran, shouldn't
the order of the input be F order...? Where as the scipy doc page below, says C order.
Trying F order caused a segmentation fault. So I'm back to square one.
ORIGINAL POST
I finally tracked down my problem, which was in the details as usual.
I'm using an array of np.float32 which were stored as F order. I can't control the F order to my knowledge, since the data is loaded from images using an imaging library.
import scipy
roi = np.ascontiguousarray( roi )# see roi.flags below
out = scipy.linalg.blas.sgemm(alpha=1.0, a=roi, b=roi, trans_b=True)
This level 3 blas routine does the trick. My problem was two fold:
roi.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
And... i was using blas dgemm NOT sgemm. The 'd' is for 'double' and 's' for 'single'.
See this pdf: BLAS summary pdf
I looked at it once and was overwhelmed...I went back and read the wikipedia article on blas routines to understand level 3 vs other levels: wikipedia article on blas
Now it works on A = 150,000 x 265, performing:
A \dot A.T
Thanks everyone for your thoughts...knowing that it could be done was most important.

cuda uncorrectable ECC error encountered

My environment is
Windows 7 x64
Matlab 2012a x64
Cuda SDK 4.2
Tesla C2050 GPU
I am having trouble figuring out why my GPU is crashing with the "uncorrectable ECC error encountered". This error only occurs when i use 512 threads or more. I can't post the kernel, but i will try to describe what it does.
In general, the kernel takes a number of parameters and produces 2 complex matricies defined by the thread size, M and another number, N. So the returned matrices will be of size MxN. A typical configuration is 512x512, but each number is independent and can vary up or down. The kernel works when the numbers are 256x256.
Each thread (kernel) extracts a 999 size vector out of a 2D array based on the thread id, ie size 999xM, then cycles through the row (0 .. N-1) of the output matrices for calculation. A number of intermediate parameters are calculated, only using pow, sin and cos among the + - * / operators. To calculate one of the output matrices an additional loop needs to be executed to sum up the contribution of the 999 vector that was extracted earlier. This loop does some intermediate calculations to determine a range of values that will allow contribution. The contribution is then scaled by a factor determined by the cos and sine values of a calculated fractional value. This is where it crashes. If i stick in a constant value or 1.0 or any other for that matter, the kernel executes without trouble. however, when only one of the calls (cos or sine) is included, the kernel crashes.
Some psuedocode follows:
kernel()
{
/* Extract 999 vector from 2D array 999xM - one 999 vector for each thread. */
for (int i = 0; i < 999; i++)
{
.....
}
/* Cycle through the 2nd dimension of the output matricies */
for (int j = 0; j < N; j++)
{
/* Calculate some intermediate variables */
/* Calculate the real and imaginary components of the first output matrix */
/* real = cos(value), imaginary = sin(value) */
/* Construct the first output matrix from some intermediate variables and the real and imaginary components */
/* Calculate some more intermediate variables */
/* cycle through the extracted vector (0 .. 998) */
for (int k = 0; k < 999; k++)
{
/* Calculate some more intermediate variables */
/* Determine the range of allowed values to contribute to the second output matrix. */
/* Calculate the real and imaginary components of the second output matrix */
/* real = cos(value), imaginary = sin(value) */
/* This is were it crashes, unless real and imaginary are constant values (1.0) */
/* Sum up the contributions of the extracted vector to the second output matrix */
}
/* Construct the Second output matrix from some intermediate variables and the real and imaginary components */
}
}
I thought this could be due to a register limit, but the occupancy calculator indicates that this is not the case, I'm using less than the 32,768 registers with 512 threads. Can anyone give any suggestions as to what the cause of this could be?
Here is the ptasx info:
ptxas info : Compiling entry function '_Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_' for 'sm_20'
ptxas info : Function properties for _Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_
8056 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __internal_trig_reduction_slowpathd
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 53 registers, 232 bytes cmem[0], 144 bytes cmem[2], 28 bytes cmem[16]
tmpxft_00001d70_00000000-3_MexFunciton.cudafe1.cpp
"Uncorrectable ECC error" usually refers to a hardware failure. ECC is Error Correcting Code, a means to detect and correct errors in bits stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM every once in a great while, but "uncorrectable ECC error" indicates that several bits are coming out of RAM storage "wrong" - too many for the ECC to recover the original bit values.
This could mean that you have a bad or marginal RAM cell in your GPU device memory.
Marginal circuits of any kind may not fail 100%, but are more likely to fail under the stress of heavy use - and associated rise in temperature.
There are diagnostic utilities floating around to stress-test all the RAM banks of your PC to confirm or pinpoint which chip is failing, but I don't know of an analog for testing the device RAM banks of the GPU.
If you have access to another machine with a GPU of similar capability, try running your app on that machine to see how it behaves. If you don't get the ECC error on the second machine, this confirms that the problem is almost certainly in the hardware of the first machine. If you get the same ECC error on the second machine, then ignore everything I've written here and continue looking for your software bug. Unless your code is actually causing hardware damage, the chances of two machines having the same hardware failure are extremely small.