OpenMP offloading on GPU, 'simd' specificities - gpu

I was wondering how to interpret the following OpenMP constructs:
#pragma omp target teams distribute parallel for
for(int i = 0; i < N; ++i) {
// compute
}
#pragma omp target teams distribute parallel for simd
for(int i = 0; i < N; ++i) {
// compute
}
Note the simd clause added on the second loop. According to the OpenMP 5.1 specification, this clause declare that: "multiple iterations of the loop can be executed concurrently by using SIMD instructions".
I believe I can very well conceive how simd is implemented and behaves on CPU but on GPU, more precisely, AMD GPUs, there is no such thing as exposed SIMD instruction in the sense that a HIP thread is in fact a SIMD instruction lane.
According to the OpenMP specification, if there is a loop carried dependency or if the compiler can not prove there is none, when OpenMP maps the teams to thread blocks/workgroups and the treads to simd lanes it is forced to use thread blocks of only one thread.
How do you interpret the target teams distribute parallel for simd:
Does it mean that in this context simd can't be translated for a GPU?
Or maybe - each thread is handled as if it had a single SIMD lane?
There is at least one similar but old and unanswered question:
How is omp simd for loop executed on GPUs?

According to the test case below, the assembly generated for AMD MI250 (gfx90a) is the same with or without simd. Though, if you look at the CPU code, you shall see a significant change with the simd clause which in this case, allows for a similar optimization to the ones observed with an explicit usage of the restrict keyword.
TLDR: Currently, the simd clause is irrelevant and only leads to this warning, even for extremely trivial cases:
loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning].
#include <cstdint>
#define RESTRICT __restrict
using Float = double;
void test0_0(Float* a, const Float* b) {
a[0] = b[0] * b[0];
// Forced store/reload (b[0] could be a[0]).
a[1] = b[0];
}
void test0_1(Float* RESTRICT a, const Float* RESTRICT b) {
a[0] = b[0] * b[0];
// No forced store/reload.
a[1] = b[0];
}
void test1_0(Float* a, Float* b, std::size_t length) {
#pragma omp parallel for
for (std::size_t i = 0; i < length; i += 2) {
a[i + 0] = b[i + 0] * b[i + 0];
// Forced store/reload
a[i + 1] = b[i + 0];
}
}
void test1_1(Float* a, Float* b, std::size_t length) {
#pragma omp parallel for simd
for (std::size_t i = 0; i < length; i += 2) {
a[i + 0] = b[i + 0] * b[i + 0];
// simd -> no loop carried dependencies:
// No forced store/reload -> easier vectorization, less generated code.
a[i + 1] = b[i + 0];
}
}
void test2_0(Float* a, Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for
for (std::size_t i = 0; i < length; i += 2) {
a[i + 0] = b[i + 0] * b[i + 0];
// ASM shows forced store/reload, as expected.
a[i + 1] = b[i + 0];
}
}
void test2_1(Float* RESTRICT a, Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for
for (std::size_t i = 0; i < length; i += 2) {
a[i + 0] = b[i + 0] * b[i + 0];
// ASM shows forced store/reload, but a/b are restricted BAD!
a[i + 1] = b[i + 0];
}
}
void test3_0(Float* a, const Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
for (std::size_t i = 0; i < length; i += 2) {
a[i + 0] = b[i + 0] * b[i + 0];
// ASM shows forced store/reload, but a/b are restricted BAD!
a[i + 1] = b[i + 0];
}
}
void test3_1(Float* RESTRICT a, const Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
for (std::size_t i = 0; i < length; i += 2) {
a[i + 0] = b[i + 0] * b[i + 0];
// ASM shows forced store/reload, but a/b are restricted BAD!
a[i + 1] = b[i + 0];
}
}
test2_1(Float* RESTRICT a, Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for
for (std::size_t i = 0; i < length; i += 2) {
a[i + 0] = b[i + 0];
// ASM shows forced store/reload, but a/b are restricted BAD!
a[i + 1] = b[i + 0];
}
}
void test3_0(Float* a, const Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
for (std::size_t i = 0; i < length; i += 2) {
a[i + 0] = b[i + 0];
// ASM shows forced store/reload, but a/b are restricted BAD!
a[i + 1] = b[i + 0];
}
}
void test3_1(Float* RESTRICT a, const Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
for (std::size_t i = 0; i < length; i += 2) {
a[i + 0] = b[i + 0];
// ASM shows forced store/reload, but a/b are restricted BAD!
a[i + 1] = b[i + 0];
}
}
Code available at: https://godbolt.org/z/sMY48s8jz

Related

How to optimize histogram statistics with neon intrinsics?

I want to optimize histogram statistic code with neon intrinsics.But I didn't succeed.Here is the c code:
#define NUM (7*1024*1024)
uint8 src_data[NUM];
uint32 histogram_result[256] = {0};
for (int i = 0; i < NUM; i++)
{
histogram_result[src_data[i]]++;
}
Historam statistic is more like serial processing.It's difficult to optimize with neon intrinsics.Does anyone know how to optimize?Thanks in advance.
You can't vectorise the stores directly, but you can pipeline them, and you can vectorise the address calculation on 32-bit platforms (and to a lesser extent on 64-bit platforms).
The first thing you'll want to do, which doesn't actually require NEON to benefit, is to unroll the histogram array so that you can have more data in flight at once:
#define NUM (7*1024*1024)
uint8 src_data[NUM];
uint32 histogram_result[256][4] = {{0}};
for (int i = 0; i < NUM; i += 4)
{
uint32_t *p0 = &histogram_result[src_data[i + 0]][0];
uint32_t *p1 = &histogram_result[src_data[i + 1]][1];
uint32_t *p2 = &histogram_result[src_data[i + 2]][2];
uint32_t *p3 = &histogram_result[src_data[i + 3]][3];
uint32_t c0 = *p0;
uint32_t c1 = *p1;
uint32_t c2 = *p2;
uint32_t c3 = *p3;
*p0 = c0 + 1;
*p1 = c1 + 1;
*p2 = c2 + 1;
*p3 = c3 + 1;
}
for (int i = 0; i < 256; i++)
{
packed_result[i] = histogram_result[i][0]
+ histogram_result[i][1]
+ histogram_result[i][2]
+ histogram_result[i][3];
}
Note that p0 to p3 can never point to the same address, so reordering their reads and writes is just fine.
From that you can vectorise the calculation of p0 to p3 with intrinsics, and you can vectorise the finalisation loop.
Test it as-is first (because I didn't!). Then you can experiment with structuring the array as result[4][256] instead of result[256][4], or using a smaller or larger unroll factor.
Applying some NEON intrinsics to this:
uint32 histogram_result[256 * 4] = {0};
static const uint16_t offsets[] = { 0x000, 0x001, 0x002, 0x003,
0x000, 0x001, 0x002, 0x003 };
uint16x8_t voffs = vld1q_u16(offsets);
for (int i = 0; i < NUM; i += 8) {
uint8x8_t p = vld1_u8(&src_data[i]);
uint16x8_t p16 = vshll_n_u8(p, 16);
p16 = vaddq_u16(p16, voffs);
uint32_t c0 = histogram_result[vget_lane_u16(p16, 0)];
uint32_t c1 = histogram_result[vget_lane_u16(p16, 1)];
uint32_t c2 = histogram_result[vget_lane_u16(p16, 2)];
uint32_t c3 = histogram_result[vget_lane_u16(p16, 3)];
histogram_result[vget_lane_u16(p16, 0)] = c0 + 1;
c0 = histogram_result[vget_lane_u16(p16, 4)];
histogram_result[vget_lane_u16(p16, 1)] = c1 + 1;
c1 = histogram_result[vget_lane_u16(p16, 5)];
histogram_result[vget_lane_u16(p16, 2)] = c2 + 1;
c2 = histogram_result[vget_lane_u16(p16, 6)];
histogram_result[vget_lane_u16(p16, 3)] = c3 + 1;
c3 = histogram_result[vget_lane_u16(p16, 7)];
histogram_result[vget_lane_u16(p16, 4)] = c0 + 1;
histogram_result[vget_lane_u16(p16, 5)] = c1 + 1;
histogram_result[vget_lane_u16(p16, 6)] = c2 + 1;
histogram_result[vget_lane_u16(p16, 7)] = c3 + 1;
}
With the histogram array unrolled x8 rather than x4 you might want to use eight scalar accumulators instead of four, but you have to remember that that implies eight count registers and eight address registers, which is more registers than 32-bit ARM has (since you can't use SP and PC).
Unfortunately, with address calculation in the hands of NEON intrinsics, I think the compiler can't safely reason on how it might be able to re-order reads and writes, so you have to reorder them explicitly and hope that you're doing it the best possible way.

iOS:CRC in obj c

i am new to iOS i need to create data packet by using CRC algorithm for the below commands
int comm[6];
comm[0]=0x01;
comm[1]=6;
comm[2]=0x70;
comm[3]=0x00;
comm[4]=0xFFFF;
comm[5]=0xFFFF;
i had a java code which as same thing developing in android
byte[] getCRC(byte[] bytes)
{
byte[] result = new byte[2];
try
{
short crc = (short) 0xFFFF;
for (int j = 0; j < bytes.length; j++)
{
byte c = bytes[j];
for (int i = 7; i >= 0; i--)
{
boolean c15 = ((crc >> 15 & 1) == 1)
boolean bit = ((c >> (7 - i) & 1) == 1);
crc <<= 1;
if (c15 ^ bit)
{
crc ^= 0x1021; // 0001 0000 0010 0001 (0, 5, 12)
}
}
}
int crc2 = crc - 0xffff0000;
result[0] = (byte) (crc2 % 256);
result[1] = (byte) (crc2 / 256);
return result;
}
catch(Exception ex)
{
result = null;
return result;
}
}
Input for getCRC() method: The data packet for which CRC is to be calculated.
Output of getCRC() method: CRC for the packet.
The same thing i need to do in obj c please help if any sample code available also.
Objective-C also incorporates C, so the contents of your method will look almost the same as in Java. All that is needed is to pass your data into and out of the method, in this example using NSData:
- (NSData *)bytesCRCResult:(NSData *)dataBytes
{
unsigned char *result = (unsigned char *)malloc(2);
unsigned char *bytes = (unsigned char *)[dataBytes bytes]; // returns readonly pointer to the byte stream
uint16_t crc = (short) 0xFFFF;
for (int j = 0; j < dataBytes.length; j++)
{
unsigned char c = bytes[j];
for (int i = 7; i >= 0; i--)
{
bool c15 = ((crc >> 15 & 1) == 1);
bool bit = ((c >> (7 - i) & 1) == 1);
crc <<= 1;
if (c15 ^ bit)
{
crc ^= 0x1021; // 0001 0000 0010 0001 (0, 5, 12)
}
}
}
uint16_t crc2 = crc - 0xffff0000;
result[0] = (unsigned char) (crc2 % 256);
result[1] = (unsigned char) (crc2 / 256);
NSData *resultsToData = [NSData dataWithBytes:result length:2];
free(result);
return resultsToData;
}
NSData can be read as raw bytes using the [NSData bytes] method call, and has a range of useful properties and methods.
For the boolean value, you have a few options:
"bool" seems to be the ISO C/C++ standard type
"Boolean" is defined as "typedef unsigned char"
"boolean_t" is defined as "typedef unsigned int" or "typedef int", depending on 64-bit compilation apparently
"BOOL", the Objective-C bool, which is defined as "typedef signed char", according to http://nshipster.com/bool/ and might therefore not behave as expected.
"uint8_t" can be substituted for "unsigned char", for clarity.
Please note: The above code compiles without warning or complaint, but wasn't tested with actual data.

The scripted value is neither an array nor a pointer

So I am trying to allocate memory for a 2D array of ints such that I can reference it outside of the loop in which the size is determined. (I have a scope issue because the size of the array isn't fixed.)
So this was the proposed solution, but I am getting the error "The scripted value is neither an array nor a pointer". Anyone know what I am doing wrong?
//M and m are just 2 int numbers
int X = self.create2dArray(M,m);
for(int kk = 0; kk < M; kk++)
{
for (int kk1 = 0; kk1 < m; kk1++)
{
//small "x" is an NSMutableArray of NSNumbers. So I am just running the 2 for loops to fill the whole 2D array
X[kk][kk1] = [[x objectAtIndex: (kk + kk1 * J)] intValue]; //ERROR Line
}
}
//outside of Main
static inline int **create2dArray(int w, int h)
{
size_t size = sizeof(int) * 2 + w * sizeof(int *);
int **arr = malloc(size);
int *sizes = (int *) arr;
sizes[0] = w;
sizes[1] = h;
arr = (int **) (sizes + 2);
for (int i = 0; i < w; i++)
{
arr[i] = calloc(h, sizeof(**arr));
}
return arr;
}
I believe that first line should start with int** X instead of int X
Okay, so your problem may most likely lie within your manually allocating of the memory for the integers. My proposed solution is to just fill it up with random "filler" numbers, for instance: 0. By doing so, you don't risk messing up the allocation process. In addition, it's much easier and it works given you'll be filling it up with integers later on. Hope this helped!

De-interleave and interleave buffer with vDSP_ctoz() and vDSP_ztoz()?

How do I de-interleave the float *newAudio into float *channel1 and float* channel2 and interleave it back into newAudio?
Novocaine *audioManager = [Novocaine audioManager];
__block float *channel1;
__block float *channel2;
[audioManager setInputBlock:^(float *newAudio, UInt32 numSamples, UInt32 numChannels) {
// Audio comes in interleaved, so,
// if numChannels = 2, newAudio[0] is channel 1, newAudio[1] is channel 2, newAudio[2] is channel 1, etc.
// Deinterleave with vDSP_ctoz()/vDSP_ztoz(); and fill channel1 and channel2
// ... processing on channel1 & channel2
// Interleave channel1 and channel2 with vDSP_ctoz()/vDSP_ztoz(); to newAudio
}];
What would these two lines of code look like? I don't understand the syntax of ctoz/ztoz.
What I do in Novocaine's accessory classes, like the Ringbuffer, for de-interleaving:
float zero = 0.0;
vDSP_vsadd(data, numChannels, &zero, leftSampleData, 1, numFrames);
vDSP_vsadd(data+1, numChannels, &zero, rightSampleData, 1, numFrames);
for interleaving:
float zero = 0.0;
vDSP_vsadd(leftSampleData, 1, &zero, data, numChannels, numFrames);
vDSP_vsadd(rightSampleData, 1, &zero, data+1, numChannels, numFrames);
The more general way to do things is to have an array of arrays, like
int maxNumChannels = 2;
int maxNumFrames = 1024;
float **arrays = (float **)calloc(maxNumChannels, sizeof(float *));
for (int i=0; i < maxNumChannels; ++i) {
arrays[i] = (float *)calloc(maxNumFrames, sizeof(float));
}
[[Novocaine audioManager] setInputBlock:^(float *data, UInt32 numFrames, UInt32 numChannels) {
float zero = 0.0;
for (int iChannel = 0; iChannel < numChannels; ++iChannel) {
vDSP_vsadd(data, numChannels, &zero, arrays[iChannel], 1, numFrames);
}
}];
which is what I use internally a lot in the RingBuffer accessory classes for Novocaine. I timed the speed of vDSP_vsadd versus memcpy, and (very, very surprisingly), there's no speed difference.
Of course, you can always just use a ring buffer, and save yourself the hassle
#import "RingBuffer.h"
int maxNumFrames = 4096
int maxNumChannels = 2
RingBuffer *ringBuffer = new RingBuffer(maxNumFrames, maxNumChannels)
[[Novocaine audioManager] setInputBlock:^(float *data, UInt32 numFrames, UInt32 numChannels) {
ringBuffer->AddNewInterleavedFloatData(data, numFrames, numChannels);
}];
[[Novocaine audioManager] setOuputBlock:^(float *data, UInt32 numFrames, UInt32 numChannels) {
ringBuffer->FetchInterleavedData(data, numFrames, numChannels);
}];
Hope that helps.
Here is an example:
#include <Accelerate/Accelerate.h>
int main(int argc, const char * argv[])
{
// Bogus interleaved stereo data
float stereoInput [1024];
for(int i = 0; i < 1024; ++i)
stereoInput[i] = (float)i;
// Buffers to hold the deinterleaved data
float leftSampleData [1024 / 2];
float rightSampleData [1024 / 2];
DSPSplitComplex output = {
.realp = leftSampleData,
.imagp = rightSampleData
};
// Split the data. The left (even) samples will end up in leftSampleData, and the right (odd) will end up in rightSampleData
vDSP_ctoz((const DSPComplex *)stereoInput, 2, &output, 1, 1024 / 2);
// Print the result for verification
for(int i = 0; i < 512; ++i)
printf("%d: %f + %f\n", i, leftSampleData[i], rightSampleData[i]);
return 0;
}
sbooth answers how to de-interleave using vDSP_ctoz. Here's the complementary operation, namely interleaving using vDSP_ztoc.
#include <stdio.h>
#include <Accelerate/Accelerate.h>
int main(int argc, const char * argv[])
{
const int NUM_FRAMES = 16;
const int NUM_CHANNELS = 2;
// Buffers for left/right channels
float xL[NUM_FRAMES];
float xR[NUM_FRAMES];
// Initialize with some identifiable data
for (int i = 0; i < NUM_FRAMES; i++)
{
xL[i] = 2*i; // Even
xR[i] = 2*i+1; // Odd
}
// Buffer for interleaved data
float stereo[NUM_CHANNELS*NUM_FRAMES];
vDSP_vclr(stereo, 1, NUM_CHANNELS*NUM_FRAMES);
// Interleave - take separate left & right buffers, and combine into
// single buffer alternating left/right/left/right, etc.
DSPSplitComplex x = {xL, xR};
vDSP_ztoc(&x, 1, (DSPComplex*)stereo, 2, NUM_FRAMES);
// Print the result for verification. Should give output like
// i: L, R
// 0: 0.00, 1.00
// 1: 2.00, 3.00
// etc...
printf(" i: L, R\n");
for (int i = 0; i < NUM_FRAMES; i++)
{
printf("%2d: %5.2f, %5.2f\n", i, stereo[2*i], stereo[2*i+1]);
}
return 0;
}

Converting int to bytes and switching endian efficiently

I have to do some int -> byte conversion and switch to big endian for some MIDI data I'm writing. Right now, I'm doing it like:
int tempo = 500000;
char* a = (char*)&tempo;
//reverse it
inverse(a, 3);
[myMutableData appendBytes:a length:3];
and the inverse function:
void inverse(char inver_a[],int j)
{
int i,temp;
j--;
for(i=0;i<(j/2);i++)
{
temp=inver_a[i];
inver_a[i]=inver_a[j];
inver_a[j]=temp;
j--;
}
}
It works, but it's not real clean, and I don't like that I'm having to specify 3 both times (since I have the luxury of knowing how many bytes it will end up).
Is there a more convenient way I should be approaching this?
Use the Core Foundation byte swapping functions.
int32_t unswapped = 0x12345678;
int32_t swapped = CFSwapInt32HostToBig(unswapped);
char* a = (char*) &swapped;
[myMutableData appendBytes:a length:sizeof(int32_t)];
This should do the trick:
/*
Quick swap of Endian.
*/
#include <stdio.h>
int main(){
unsigned int number = 0x04030201;
char *p1, *p2;
int i;
p1 = (char *) &number;
p2 = (p1 + 3);
for (i=0; i<2; i++){
*p1 ^= *p2;
*p2 ^= *p1;
*p1 ^= *p2;
}
return 0;
}
You can pack it into a function in whatever way you want to use it. The bitwise swap should compile into some pretty neat assembly :)
Hope it helps :)
int tempo = 500000;
//reverse it
inverse(&tempo);
[myMutableData appendBytes:(char*)tempo length:sizeof(tempo)];
and the inverse function:
void inverse(int *value)
{
char inver_a = (char*)value;
int j = sizeof(*value); //or u can put 2
int i,temp;
// commenting this j--;
for(i=0;i<(j/2);i++)
{
temp=inver_a[i];
inver_a[i]=inver_a[j];
inver_a[j]=temp;
j--;
}
}