MATLAB calling cuda GPU code without kernel functions - gpu

As the title indicates. I have C-mex code, which I can successfully call from MATLAB. Now I simply change the suffix from *.cpp into *.cu, and compile with command
mexcuda updateElectricFields.cu
Compiling and running is successful and yielded the correct results. My question is, is this *.cu file executed on GPU? is the GPU trying to access CPU memory causing slower performance than CPU?

Related

cupy jit kernal vs numba cuda

(I'm sending attachement as a google drive link at "https://drive.google.com/file/d/1yzXfu5ZdY5ByxfYwTQLrQWgzTZrsww8u/view?usp=sharing" because it is big)
I'm comparing the results of the attached cupy implementation file, "test_cupy_jit_kernel.py" (using jit.rawkernel) to a numba cuda implementation file, "test_numba_cuda.py" (using cuda.jit kernel). The functions in both files are almost the same (except for using cupy.absolute instead of the simple python abs function). There is a huge difference in results as shown in the attached result files ("init_paw_cupy_jit.npy", and "init_paw_numba.npy"). I don't know why.
So, I tried a simple cupy implementation (without kernels) in the attached "test_cupy_simple.py" file. The results (the attached "init_paw_numba.npy") are very close to the numba code. But it is still different from the cupy jit kernel implementation. Despite this simple cupy implementation works, it takes a longer time than numba, and cupy jit kernel.
I don't know how to fix the different results.
BTW, I also tried fused kernel, but I recieved different and strange errors, so I gave up on fused kernel.
Regards

How to build a C file and run it on the integrated GPU of my computer?

I have a single-file C program that is heavy with floating-point calculations. How should I build it with specific gcc flags, and run it with the integrated GPU on my computer? Just to compare if the program runs faster with GPU, compared with normal CPU. (Actually I am not sure if this question makes sense in the first place. )
I am using a computer specified as follows:

Would a Vulkan program run on a device without gpu (discrete or integrated)?

Perhaps this question could be rephrased as 'what would happen if I were to try and run a Vulkan program on a cpu-only build'.
I'm wondering whether the program would run but not produce output, crash or not build in the first place (although I expect the building process to be for a cpu architecture instead of a gpu architecture).
Would it use the on-motherboard graphics to produce output? In that case, what would happen if the program was run on a cpu-only server?
Depends on how the program initialized vulkan.
Any build can have the vulkan loader installed this is the dynamically loaded library that finds the actual driver, if that is missing the program would be unable to load the loader and may either fail to start or show an error message, depending on how they try and load that.
If no device is available then the number of devices is 0. This is again up to the application to manage. Either by going for an alternative graphics API (opengl) or a error message and failing to start.

Cuda profiler shows strange gaps?

I am trying to figure out what a profile result means, before I start to optimize. I am very new with CUDA and profiling in general and I am confused by the result.
Specifically, I want to know what is happening during seemingly unoccupied chunks of computation. When I look from top to bottom at the CPU and GPU there appears to be nothing happening during large portions of the code. These look like columns with nothing in Thread1 and nothing in GeForce. Is this normal? Whats happening here?
The run was done a multicore machine under no load with nvprof. The GPU code was compiled with -arch=sm_20 -m32 -g -G for CUDA 5.
Larger Image
The error here was to profile the code in debug mode (-G compiler flag: "Generate debug information for device code"). The behavior of the program is deeply changed, and this should not be used to profile and optimize one's code.
One other thing: a thorough documentation of nvcc's debug mode is hard to find. nvcc probably dumps the registers/shared memory in global memory for easier host access and debugging, which may in turn hide problems such as race conditions in shared memory (cf. discussion here: https://stackoverflow.com/a/10726970/1043187). Thus, programs such as cuda-memcheck --tool racecheck should be used in release mode too.

RAR password recovery on GPU using ATI Stream processor

I'm newbie in GPU programming , and i work on brute force RAR Password Recovery on ATI Stream Processor using brook+ language, but i see that the kernel written in brook+ language doesn't allow any calling to normal functions (except kernel functions) , my questions is :
1) how to use unrar.dll (to unrar archive files) API in this situation? and is this the only way to program RAR password recovery?
2) what about crack and ElcomSoft software that use GPU , how they work ?
3) what exactly the role for the function work inside GPU (ATI Stream processor or CUDA) in this program?
4) is nVidia/CUDA technology is easier/more flexible than ATI/brook+ language ?
1) unrar.dll is a compiled dynamic link library. These execute on the CPU. GPUs have vastly different machine code and a very different execution model, so they can't run dlls.
You could try to implement a callback from the GPU to the CPU via events, or build an x86 interpreter on the GPU, but these would almost certainly run slower than just running on the CPU.
Using unrar.dll is not the only way to program RAR password recovery. You could instead just build your own code for CPU and GPU from scratch.
2) They work by having the CPU code explicitly request that some GPU code run on the GPU.
3) I don't know exactly. I would guess though that it has a GPU program that tries many different combinations, and benefits from having these run in parallel.
4) CUDA is more mature than brook+. brook+ may be just as easy for simple tasks, but isn't as fully featured. For new projects most people would now choose OpenCL over brook+.
(I'm not sure what you're intending to do, but none of the above seems likely to enable anything sinister.)