I have implemented a parallel BF Generator in Python like in this Post!
Parallelize brute force generation.
I want to implement this parallel technique on a GPU. Should be like a parallel BF Generator on a GPU.
Can someone help me out with some code examples for a parallel BF Generator on a GPU?
Couldn't find any examples online which made me suspicious.
Look at this implementation - i did the distribution on the GPU with this code:
void IncBruteGPU( unsigned char* theBrute, unsigned int charSetLen, unsigned int bruteLength, unsigned int incNr){
unsigned int i = 0;
while(incrementBy > 0 && i < bruteLength){
int add = incrementBy + ourBrute[i];
ourBrute[i] = add % charSetLen;
incrementBy = add / charSetLen;
i++;
}
}
call it like this:
// the Thread index number
int idx = get_global_id(0);
// the length of your charset "abcdefghi......"
unsigned int charSetLen = 26;
// the length of the word you want to brute
unsigned int bruteLength = 6;
// theBrute keeps the single start numbers of the alphabeth
unsigned char theBrute[MAX_BRUTE_LENGTH];
IncrementBruteGPU(theBrute, charSetLen, bruteLength, idx);
Good Luck!
Related
I compiled a kernel in NVRTC:
__global__ void kernel_A(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx / 32;
unsigned char lane_id = idx % 32;
/* ... */
}
I know integer division and modulo are very costly on CUDA GPUs. However I thought this kind of division-by-power-of-2 should be optimized into bit operations, until I found it isn't:
__global__ void kernel_B(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx >> 5;
unsigned char lane_id = idx & 31;
/* ... */
}
it seems kernel_B just runs faster. When omitting all other codes in kernel, launching with 1024 blocks of size 1024, nvprof shows kernel_A runs for 15.2us in average, while kernel_B runs 7.4us in average. I speculate NVRTC did not optimize out the integer division and modulo.
The result is obtained on a GeForce 750 Ti, CUDA 8.0, averaged from 100 calls. The compiler options given to nvrtcCompileProgram() is -arch compute_50.
Is this expected?
Did a thorough bugsweep in the codebase. Turns out my app was built in DEBUG mode. This causes additional flags -G and -lineinfo passed to nvrtcCompileProgram()
From nvcc man page:
--device-debug (-G)
Generate debug information for device code. Turns off all optimizations.
Don't use for profiling; use -lineinfo instead.
I have the following code which progressively goes through a string of bits and rearrange them into blocks of 20bytes. I'm using 32*8 blocks with 40 threads per block. However the process takes something like 36ms on my GT630M. Are there any further optimization I can do? Especially with regard to removing the if-else in the inner most loop.
__global__ void test(unsigned char *data)
{
__shared__ unsigned char dataBlock[20];
__shared__ int count;
count = 0;
unsigned char temp = 0x00;
for(count=0; count<(streamSize/8); count++)
{
for(int i=0; i<8; i++)
{
if(blockIdx.y >= i)
temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<<blockIdx.y))>>(blockIdx.y - i);
else
temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<<blockIdx.y))<<(i - blockIdx.y);
}
dataBlock[threadIdx.x] = temp;
//do something
}
}
It's not clear what your code is trying to accomplish, but a couple obvious opportunities are:
1) if possible, use 32-bit words instead of unsigned char.
2) use block sizes that are multiples of 32.
3) The conditional code may not be costing you as much as you expect. You can check by compiling with --cubin --gpu-architecture sm_xx (where xx is the SM version of your target hardware), and using cuobjdump --dump-sass on the resulting cubin file to look at the generated assembly. You may have to modify the source code to loft the common subexpression into a separate variable, and/or use the ternary operator ? : to hint to the compiler to use predication.
saturating instructions saturate unsigned to unsigned or signed to signed int.
What's the best way to saturate signed 16-bit ints to unsigned byte?
In short, here's the logic
uint8_t usat8(uint8_t u8, int16_t s16)
{
s16 += u8;
if(s16 <= 0) {
return 0;
} else if(s16 >=255){
return 255;
}else{
return (uint8_t)s16;
}
}
void add_row(uint8_t * dst, uint8_t * u8, int16_t * s16)
{
for(int i=0; i<XXX; ++i)
{
dst[i] = usat8(u8[i] + s16[i]);
}
}
values of s16 are usually not much off from the [0, 255] range, e.g. it's safe to assume that abs(s16[x]) < 1000.
EDIT: I just realized that USAT16 actually saturates signed 16-bit int to unsigned integer. Simple USAT16 is the solution to the problem.
After 5 mins of thinking I have this idea (pseudo arm-asm):
sadd16 sum, s16, u8 # do two additions in parallel
orr signs, 0x1001, sum, lsr #15 # extract signs of the two 16 bit results
usat16 sum, sum, #8 # saturate both of the 16-bit sums to unsigned 8-byte range
uadd16 sum, sum, signs
this way, if sign bit was set for any of the sums the resulting sum will become 256, or 0x100. When writing back the data the shifted out 0x1 will be discarded.
Any comments, does that seem like the optimal approach, is there any better alternative?
PS. I do it for an armv6 device, no NEON or armv6t2
I have a 2D matrix in the main. I want to transfer if from host to device. Can you tell me how I can allocate memory for it and transfer it to the device memory?
#define N 5
__global__ void kernel(int a[N][N]){
}
int main(void){
int a[N][N];
cudaMalloc(?);
cudaMemcpy(?);
kernel<<<N,N>>>(?);
}
Perhaps something like this is what you really had in mind:
#define N 5
__global__ void kernel(int *a)
{
// Thread indexing within Grid - note these are
// in column major order.
int tidx = threadIdx.x + blockIdx.x * blockDim.x;
int tidy = threadIdx.y + blockIdx.y * blockDim.y;
// a_ij = a[i][j], where a is in row major order
int a_ij = a[tidy + tidx*N];
}
int main(void)
{
int a[N][N], *a_device;
const size_t a_size = sizeof(int) * size_t(N*N);
cudaMalloc((void **)&a_device, a_size);
cudaMemcpy(a_device, a, a_size, cudaMemcpyHostToDevice);
kernel<<<N,N>>>(a_device);
}
The point you might have missed is that when you statically declare an array like this A[N][N], it is really just a row major ordered piece of linear memory. The compiler is automatically converting between a[i][j] and a[j + i*N] when it emits code. On the GPU, you must use the second form of access to read the memory you copy from the host.
My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define NUM_THREAD 256
#define NUM_BLOCK 256
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int trials, curandState *states){
// unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int incircle, k;
float x, y, z;
incircle = 0;
curand_init(1234, i, 0, &states[i]);
for(k = 0; k < trials; k++){
x = curand_uniform(&states[i]);
y = curand_uniform(&states[i]);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
__syncthreads();
g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {
float* solution = (float*)calloc(100, sizeof(float));
float *sumDev, *sumHost, total;
const char *error;
int trials;
curandState *devStates;
trials = 500;
total = trials*NUM_THREAD*NUM_BLOCK;
dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
sumHost = (float*)calloc(NUM_BLOCK*NUM_THREAD, sizeof(float));
cudaMalloc((void **) &sumDev, size); // Allocate array on device
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
cudaMalloc((void **) &devStates, (NUM_THREAD*NUM_BLOCK)*sizeof(curandState));
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Do calculation on device by calling CUDA kernel
monteCarlo <<<dimGrid, dimBlock>>> (sumDev, trials, devStates);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// call reduction function to sum
reduce0 <<<dimGrid, dimBlock, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
dim3 dimGrid1(1,1,1);
dim3 dimBlock1(256,1,1);
reduce0 <<<dimGrid1, dimBlock1, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, sizeof(float), cudaMemcpyDeviceToHost);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
*solution = 4*(sumHost[0]/total);
printf("%.*f\n", 1000, *solution);
free (solution);
free(sumHost);
cudaFree(sumDev);
cudaFree(devStates);
//*solution = NULL;
return 0;
}
If smaller numbers of trials work correctly, and if you are running on MS Windows without the NVIDIA Tesla Compute Cluster (TCC) driver and/or the GPU you are using is attached to a display, then you are probably exceeding the operating system's "watchdog" timeout. If the kernel occupies the display device (or any GPU on Windows without TCC) for too long, the OS will kill the kernel so that the system does not become non-interactive.
The solution is to run on a non-display-attached GPU and if you are on Windows, use the TCC driver. Otherwise, you will need to reduce the number of trials in your kernel and run the kernel multiple times to compute the number of trials you need.
EDIT: According to the CUDA 4.0 curand docs(page 15, "Performance Notes"), you can improve performance by copying the state for a generator to local storage inside your kernel, then storing the state back (if you need it again) when you are finished:
curandState state = states[i];
for(k = 0; k < trials; k++){
x = curand_uniform(&state);
y = curand_uniform(&state);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
Next, it mentions that setup is expensive, and suggests that you move curand_init into a separate kernel. This may help keep the cost of your MC kernel down so you don't run up against the watchdog.
I recommend reading that section of the docs, there are several useful guidelines.
For those of you having a geforce GPU which does not support TCC driver there is another solution based on:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx
start regedit,
navigate to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
create new DWORD key called TdrLevel, set value to 0,
restart PC.
Now your long-running kernels should not be terminated. This answer is based on:
Modifying registry to increase GPU timeout, windows 7
I just thought it might be useful to provide the solution here as well.