gpu::BFMatcher_GPU and BFMatcher gives Different Result - gpu

I am using OpenCv 2.4.10 with Cuda 7.0 on VS10
In my CPU project, finding keypoints and matching like this;
detector = new cv::SURF(150,3);
descriptorExtractor = cv::DescriptorExtractor::create("SURF");
detector->detect(gry0,keypoints0);
descriptorExtractor->compute(gry0, keypoints0, descriptor0);
detector->detect(gry1,keypoints1);
descriptorExtractor->compute(gry1, keypoints1, descriptor1);
cv::BFMatcher matcher(cv::NORM_L2);
matcher.match(descriptor1, descriptor0, matches);
In GPU;
cv::gpu::SURF_GPU surf(150,3);
surf(gpumatFrameGray0, cv::gpu::GpuMat(), keypoints0GPU, descriptors0GPU);
surf(gpumatFrameGray0, cv::gpu::GpuMat(), keypoints1GPU, descriptors1GPU);
surf.downloadKeypoints(keypoints0GPU,keypoints0);
surf.downloadKeypoints(keypoints0GPU,keypoints1);
cv::gpu::BFMatcher_GPU matcher(cv::NORM_L2);
matcher.matchSingle(descriptors1GPU, descriptors0GPU, trainIdx, distance);
matcher.matchDownload(trainIdx, distance, matches);
I have 2 questions.
1) Most of the location of Keypoints for CPU and GPU are same. There is a difference like 0,000002 between some values. Is this normal and why this happens?
2) My second and important question is that matching of CPU and GPU is different. I show a table of matching
This table shows position of keypoints in images that are matched for CPU and GPU.
For example a keypoint that has x position "22.333189" in Frame 0 matches two keypoints in frame 1 for CPU code.
But for GPU code, it matches to three different keypoints.
There are many differences like this. Because of this differences, result of homography also differs and algorithm gives different result. What is the solution of this problem?
Thank You

Like said in http://answers.opencv.org/question/10745/bfmatcher-implemented-differently-on-gpu/
Floating point arithmetics are slightly different on CPU and GPU, and furthermore, they can differ on same hardware using different libraries (like IPP or NPP).
Last but not least, GPU's SURF descriptors differ from CPU's. So, matches will differ too.

Related

Optimal 2D FFT sizes on NVIDIA GPUs

We are benchmarking 2D FFT performance on an NVIDIA A100 in order to determine which sizes have the best performance. The following shows how the runtime for each size is performed. GPU memroy is cleared after each size is run.
def run_fft():
fft2(array, axes=(-2, -1), overwrite_x=True)
timing = cupyx.timing.repeat(run_fft, repeat=10, n_warmup=1)
Running it across a range of possible sizes results in the measurements below .
As you can see, there seems to be a set of sizes that are slower than the rest (the quasi-linear streaks above the main sequence). These sizes also include ones which are factors of low prime numbers (such as 2 and 3). I was wondering whether there is a general rule to define which 2D FFT sizes run optimally (for example, for the CPU case and when using fftw3, the general rule is defined here:.

Performance drop in matrix multiplication for certain sizes on AMD Polaris

I have an OpenCL code that multiplies 2 matrices (GEMM) with M=4096, N=4096 and K=16. (i.e. matrices 4096 x 16 floats)
I run it on Polaris 560, 16CU GPU.
Code: https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl
I noticed very strange performance drops for this size, matrix multiplication with this size has ~8-10 GFlops performance while if I change N to 4095 or 4097 I'm getting around 130-150Gflops. I notices similar behaviour with other GEMM libraries like clblas or miopengemm - I'm getting significant performance drop for this particular size of 4096x16 and changing N by 1 boosts the performance several times.
The workload is split into work-groups of 256 threads. Each work-group handles 128x16 and 128x16 matrix tiles (8x8 block per threads).
I tried changing matrix tiling to 96x96 with 6x6 blocks instead of 128x128 with 8x8 - same result.
I tested same code with ROCm 3.7 OpenCL, Clover OpenCL and even with Windows OpenCL driver - same behavior.
There is no such issue with nvidia gtx 960 having same number of gpu cores (threads) and same memory type/size.
I suspect that this is somehow cache/collision related but I don't understand how it happens. Thus I don't know how to work-around it.
Finally I found that clBlas library (developed for AMD originally) handles special case of lda % 1024==0, ldb % 1024==0 probably due to cache
https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/specialCases/GemmSpecialCases.cpp#L228
I found that the better way was to rearrange blocks in z-curve order instead of queuing several kernels.
https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl#L109
To handle cases M!=N or M != 1<<n I just increased number of work groups on M/N to neares 1<<n and groups that don't have jobs exit in the begging not adding too much overhead.
z-order improved performance x4 times.

In ml5 pitch detection using crepe model, how to detect pitch above ±2kHz

I'm successfully using pitch detection features of ml5:
tutorial: https://ml5js.org/reference/api-PitchDetection/
model: https://cdn.jsdelivr.net/gh/ml5js/ml5-data-and-models/models/pitch-detection/crepe/
The issue:
No pitch above ±2000Hz is detected. I tried multiple devices and checked that the sounds are visible on sonograms so it's does not seem to be a mic issue.
I assumed it may be a result of sampling rate limitations / resampling done by the library, as the Nyquist frequency (max "recordable" frequency) is that of half of the sampling rate.
I hosted the ml5 sources localy and tried modifying the PitchDetection class
There I see the sampling rate seems to be resampled to 1024Hz for performance reasons. This does not sound right though as if I'm not mistaken, this would only allow detection of frequencies up to 512hz. I am definitely missing something (or a lot).
I tried fiddling with the rates, but increasing it to, say 2048 causes an error:
Error when checking : expected crepe_input to have shape [null,1024] but got array with shape [1,2048].
My question is:
Is there something in ml5 PitchDetection class I can modify, configure (perhaps a different model) to detect frequencies higher than 2000Hz using crepe model?
After more investigation, turns out the CREPE model itself supports up to ~1997Hz (seen in code) or 1975.5 Hz (seen in paper)
The paper about CREPE:
https://arxiv.org/abs/1802.06182
States:
The 360 pitch values are denoted as c1, c2..., 360 are selected so that they cover six octaves with 20-cent intervals between C1 and B7, corresponding to 32.70 Hz and 1975.5 Hz
The JS implementation has this mapping which maps the 360 intervals to 0 - 1997Hz range:
const cent_mapping = tf.add(tf.linspace(0, 7180, 360), tf.tensor(1997.3794084376191))
This means, short of retraining the model I'm probably out of luck at using it for now.
Edit:
After a good nights sleep I found a simple solution which works for my simple application.
In it's essence, it is to resample my audio buffer so it has 2 times lower pitch. CREPE than detects a pitch of 440Hz as 220Hz, and I just need to multiply it by 2.
The result is still more consistently correct than YIN algorithm for my real time, noisy application.

Is there any good reason to transpose a tensor from NHWC to NCHW?

I often see the transpose implementation in tensorflow code. I wonder why one would want to transpose the NHWC tensor to NCHW. Please give me the good example and the reason behind it.
Rather than citing the documentation. You should read into how CUDA works and think about how to implement most operations.
The reason for NCHW generally being faster than NHWC is how the CUDA kernels are written. In CUDA you need to specify what each thread is doing like
const int threads = 32;
dim3 block(threads, threads);
dim3 grid(up2(W / 2, threads), up2(H, threads), B);
kernel<Dtype> <<< grid, block>>> (args ...)
Here you get 3 indices threadId.z, threadId.y, threadId.x. And these threads are organized in warps (hardware design).
And you want to have coalesced memory transaction, which means the threads are ordered in such a way, that the GPU can nicely operate in a fast way.
To sum it up:
You want to have "threadId.x" being the most inner-loop and you should organize the data layout such that it reading them in coalesced way. The ideal data structure should accessible by
b * C * H * W + c * H * W + h * W + w
where lower letters denote the index and capitalized letters denotes the shape (e.g., 0 <= w < W).
In convolution operations (a part of the most used layer) what you are essentially doing is cropping a region in each channel computing a dot production with a region in another channel (from another tensor). So the indices which need to run crazy fast are the height-idx and width-idx. In the end, you are adding along the channel axis (like the convolution formulae suggest). This also explains, why it makes no difference to consider NWHC, NCWH.
This has an impact on how you order the data. And it is the reason you want to have the memory layout I described above.
The worst layout would be:
H, C, B, in threadId.z, threadId.y, threadId.x
The best layout would be:
B, C, H in threadId.z, threadId.y, threadId.x
The same is (mostly) true for GEMM as well (here one matrix should be transpose). There is no source for CuDNN available. But you might be interested in looking into cutlass.
From the performance guide of Tensorflow:
NHWC is the TensorFlow default and NCHW is the optimal format to use
when training on NVIDIA GPUs using cuDNN. [...] The brief history of these two formats is that TensorFlow started by using NHWC because it was a little faster on CPUs. In the long term, we are working on tools to auto rewrite graphs to make switching between the formats transparent and take advantages of micro optimizations where a GPU Op may be faster using NHWC than the normally most efficient NCHW.
Essentially, cuDNN is optimized for NCHW, while CPU-only tensorflow is optimized for NHWC. Switching from one to the other is just a matter of performance maximization and/or unavailability of certain operations in a specific data format.

How could I read information off the graphics card after processing?

Supposing, for example, that I had a 10x10 "cloth" mesh, with each square being two triangles. Now, if I wanted to animate this, I could do the spring calculations on the CPU. Each vertex would have its own "spring" data and would, hopefully, bounce like whatever type of "cloth" it was supposed to represent.
However, that would involve a minimum of about 380? spring calculations per frame. Happily, the per-vertex calculations are "embarrassingly parallel" - Had I one CPU per vertex, each vertex could be run on a single CPU. GPUs, therefore, are theoretically an excellent choice for running such calculations on.
Except (and this is using DirectX/SlimDX) - I have no idea/am not sure how I would/should:
1) Send all this vertex data to the graphics card (yes, I know how to render stuff and have even written my own per-pixel and texture-blending global lighting effect file; however, it is necessary for each vertex to be able to access the position data of at least three other vertices). I suppose I could stick the relevant vertex positions and number of vertex positions in TextureCoords, but there may be a different, standard solution.
2) Read all the vertex data afterwards, so I can update the mesh in memory. Otherwise, each update will act on the exact same data to the exact same result, rather like adding 2 + 3 = 5, 2 + 3 = 5, 2 + 3 = 5 when you want is 2 + 3 = 5 - 2 = 3 + 1.5 = 4.5.
And it may be that I'm looking in the wrong direction to do this.
Thanks.
You could use the approach you have described to pack the data into textures and write special HLSL shaders to compute spring forces and then update vertex positions. That approach is totally valid but can be troublesome when you try to debug problems because you are using texture pixels in an unconventional way (you can draw the texture and maybe write some code to watch the values in a given pixel). It is probably easier in the long run to use something like CUDA, DirectCompute, or OpenCL. CUDA allows you to "bind" the DirectX vertex-buffer for access in CUDA. Then in a CUDA kernel you can use the positions to calculate forces and then write new positions to the vertex-buffer (in parallel on the GPU) before you render the updated positions.
There is a cloth demo that uses DirectCompute in the DirectX 10/11 DirectX SDK.