How to optimize histogram statistics with neon intrinsics? - neon

I want to optimize histogram statistic code with neon intrinsics.But I didn't succeed.Here is the c code:
#define NUM (7*1024*1024)
uint8 src_data[NUM];
uint32 histogram_result[256] = {0};
for (int i = 0; i < NUM; i++)
{
histogram_result[src_data[i]]++;
}
Historam statistic is more like serial processing.It's difficult to optimize with neon intrinsics.Does anyone know how to optimize?Thanks in advance.

You can't vectorise the stores directly, but you can pipeline them, and you can vectorise the address calculation on 32-bit platforms (and to a lesser extent on 64-bit platforms).
The first thing you'll want to do, which doesn't actually require NEON to benefit, is to unroll the histogram array so that you can have more data in flight at once:
#define NUM (7*1024*1024)
uint8 src_data[NUM];
uint32 histogram_result[256][4] = {{0}};
for (int i = 0; i < NUM; i += 4)
{
uint32_t *p0 = &histogram_result[src_data[i + 0]][0];
uint32_t *p1 = &histogram_result[src_data[i + 1]][1];
uint32_t *p2 = &histogram_result[src_data[i + 2]][2];
uint32_t *p3 = &histogram_result[src_data[i + 3]][3];
uint32_t c0 = *p0;
uint32_t c1 = *p1;
uint32_t c2 = *p2;
uint32_t c3 = *p3;
*p0 = c0 + 1;
*p1 = c1 + 1;
*p2 = c2 + 1;
*p3 = c3 + 1;
}
for (int i = 0; i < 256; i++)
{
packed_result[i] = histogram_result[i][0]
+ histogram_result[i][1]
+ histogram_result[i][2]
+ histogram_result[i][3];
}
Note that p0 to p3 can never point to the same address, so reordering their reads and writes is just fine.
From that you can vectorise the calculation of p0 to p3 with intrinsics, and you can vectorise the finalisation loop.
Test it as-is first (because I didn't!). Then you can experiment with structuring the array as result[4][256] instead of result[256][4], or using a smaller or larger unroll factor.
Applying some NEON intrinsics to this:
uint32 histogram_result[256 * 4] = {0};
static const uint16_t offsets[] = { 0x000, 0x001, 0x002, 0x003,
0x000, 0x001, 0x002, 0x003 };
uint16x8_t voffs = vld1q_u16(offsets);
for (int i = 0; i < NUM; i += 8) {
uint8x8_t p = vld1_u8(&src_data[i]);
uint16x8_t p16 = vshll_n_u8(p, 16);
p16 = vaddq_u16(p16, voffs);
uint32_t c0 = histogram_result[vget_lane_u16(p16, 0)];
uint32_t c1 = histogram_result[vget_lane_u16(p16, 1)];
uint32_t c2 = histogram_result[vget_lane_u16(p16, 2)];
uint32_t c3 = histogram_result[vget_lane_u16(p16, 3)];
histogram_result[vget_lane_u16(p16, 0)] = c0 + 1;
c0 = histogram_result[vget_lane_u16(p16, 4)];
histogram_result[vget_lane_u16(p16, 1)] = c1 + 1;
c1 = histogram_result[vget_lane_u16(p16, 5)];
histogram_result[vget_lane_u16(p16, 2)] = c2 + 1;
c2 = histogram_result[vget_lane_u16(p16, 6)];
histogram_result[vget_lane_u16(p16, 3)] = c3 + 1;
c3 = histogram_result[vget_lane_u16(p16, 7)];
histogram_result[vget_lane_u16(p16, 4)] = c0 + 1;
histogram_result[vget_lane_u16(p16, 5)] = c1 + 1;
histogram_result[vget_lane_u16(p16, 6)] = c2 + 1;
histogram_result[vget_lane_u16(p16, 7)] = c3 + 1;
}
With the histogram array unrolled x8 rather than x4 you might want to use eight scalar accumulators instead of four, but you have to remember that that implies eight count registers and eight address registers, which is more registers than 32-bit ARM has (since you can't use SP and PC).
Unfortunately, with address calculation in the hands of NEON intrinsics, I think the compiler can't safely reason on how it might be able to re-order reads and writes, so you have to reorder them explicitly and hope that you're doing it the best possible way.

Related

multidimensional very large array

Hi i want to use multidimensional very large array. I tried following code. It compiles but when i execute it it gives me segmentation fault error.
'int NT = 35; int NX = 25; int NY = 25; int NZ = 25;
double dt = 0.1; double dx = 0.5; double dy = 0.5; double dz = 0.5;
double PosT[NT];
double PosX[NX]; double PosY[NY]; double PosZ[NZ];
for(int i=0;i<NT;i++)
PosT[i] = i*dt+dt;
for(int i=0; i<NX;i++)
PosX[i] = dx*i;
for(int i=0; i<NY;i++)
PosY[i] = dy*i;
for(int i=0; i<NZ;i++)
PosZ[i] = dz*i;
double* b_x=(double*)malloc(NX*NY*NZ*sizeof(double));
double* b_y=(double*)malloc(NX*NY*NZ*sizeof(double));
double** B=(double**)malloc(NX*NY*NZ*NT*sizeof(double*));
if(b_x==NULL||b_y==NULL){
cout<<"Malloc space error!"<<endl;
return 0;
}
for(int ix=0;ix<NX;ix++){
for(int iy=0;iy<NY;iy++){
for(int iz=0;iz<NZ;iz++){
int position=ix*NY*NZ+iy*NZ+iz;
b_x[position] =0.;
b_y[position] =0.;
}
}
}'
but when i work in below part then i got segmentation error, my codes next part is following lines which include 2d arrays. and this 2d array is very large ,
perhaps due to this i am getting segmentation error
'if(B==NULL){
cout<<"Malloc space error!"<<endl;
return 0;
}
cout<<"work"<<endl;
for(int ix=0;ix<NX;ix++){
for(int iy=0;iy<NY;iy++){
for(int iz=0;iz<NZ;iz++){
int position=ix*NY*NZ+iy*NZ+iz;
for(int it=0;it<NT;it++){
B[position][it]=0.;
}
}
}
}
cout<<"not working"<<endl;'
so code between work and not working has problem which causes segmentation error. Any solutions for this.
int NT = 35; int NX = 25; int NY = 25; int NZ = 25;
For simplicity, let's change all of these to NT=NX=NY=NZ=2. This line:
double** B=(double**)malloc(NX*NY*NZ*NT*sizeof(double*));
would then allocate space for 16 pointers. On the first iteration through the loops, this line:
B[position][it]=0.;
would be equivalent to:
double *tmp = B[0]; // Load uninitialized pointer from B[0]
tmp[0] = 0.0; // Dereference uninitialized pointer to store something.
It shouldn't be at all surprising that this code results in a SIGSEGV.
What you probably meant:
double *B = malloc(NX*NY*NZ*NT*sizeof(double));
for(int ix = 0; ix < NX; ix++) {
for(int iy = 0; iy < NY; iy++) {
for(int iz = 0; iz < NZ; iz++) {
for(int it = 0; it < NT; it++) {
int position = NT * (NZ * (NY * ix + iy) + iz) + it;
B[position] = 0.0;
}
}
}
}

Convolution matrix sharpen filter

i trying to implement sharpen convolution matrix filter for image.For this i create matrix 3x3. Maybe i did something wrong with formula?Also i tried other sharpen matrix but it didnt help. Color value could be larger then 255 or smaller then zero so i decide to give some limits on this(0 255).Is it correct solution?
static const int filterSmallMatrixSize = 3;
static const int sharpMatrix[3][3] = {{-1, -1, -1},{-1, 9, -1},{-1, -1, -1}};
some define
#define Mask8(x) ( (x) & 0xFF )
#define R(x) ( Mask8(x) )
#define G(x) ( Mask8(x >> 8 ) )
#define B(x) ( Mask8(x >> 16) )
#define A(x) ( Mask8(x >> 24) )
#define RGBAMake(r, g, b, a) ( Mask8(r) | Mask8(g) << 8 | Mask8(b) << 16 | Mask8(a) << 24 )
and algorithm
- (UIImage *)processSharpFilterUsingPixels:(UIImage *)inputImage
{
UInt32 *inputPixels;
CGImageRef inputCGImage = [inputImage CGImage];
NSUInteger inputWidth = CGImageGetWidth(inputCGImage);
NSUInteger inputHeight = CGImageGetHeight(inputCGImage);
CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB();
NSUInteger bytesPerPixel = 4;
NSUInteger bitsPerComponent = 8;
NSUInteger inputBytesPerRow = bytesPerPixel * inputWidth;
inputPixels = (UInt32 *)calloc(inputHeight * inputWidth, sizeof(UInt32));
CGContextRef context = CGBitmapContextCreate(inputPixels, inputWidth, inputHeight,
bitsPerComponent, inputBytesPerRow, colorSpace,
kCGImageAlphaPremultipliedLast | kCGBitmapByteOrder32Big);
CGContextDrawImage(context, CGRectMake(0, 0, inputWidth, inputHeight), inputCGImage);
for (NSUInteger j = 1; j < inputHeight - 1; j++)
{
for (NSUInteger i = 1; i < inputWidth - 1; i++)
{
Float32 newRedColor = 0;
Float32 newGreenColor = 0;
Float32 newBlueColor = 0;
Float32 newA = 0;
for (int filterMatrixI = 0 ; filterMatrixI < filterSmallMatrixSize ; filterMatrixI ++)
{
for (int filterMatrixJ = 0; filterMatrixJ < filterSmallMatrixSize; filterMatrixJ ++)
{
UInt32 * currentPixel = inputPixels + ((j + filterMatrixJ - 1) * inputWidth) + i + filterMatrixI - 1;
int color = *currentPixel;
newRedColor += (R(color) * sharpMatrix[filterMatrixI][filterMatrixJ]);
newGreenColor += (G(color) * sharpMatrix[filterMatrixI][filterMatrixJ]);
newBlueColor += (B(color)* sharpMatrix[filterMatrixI][filterMatrixJ]);
newA += (A(color) * sharpMatrix[filterMatrixI][filterMatrixJ]);
}
}
int r = MAX( MIN((int)newRedColor,255), 0);
int g = MAX( MIN((int)newGreenColor,255), 0);
int b = MAX( MIN((int)newBlueColor,255), 0);
int a = MAX( MIN((int)newA,255), 0);
UInt32 *currentMainImagePixel = inputPixels + (j * inputWidth) + i;
*currentMainImagePixel = RGBAMake(r,g,b,a);
}
}
CGImageRef newCGImage = CGBitmapContextCreateImage(context);
UIImage * processedImage = [UIImage imageWithCGImage:newCGImage];
CGColorSpaceRelease(colorSpace);
CGContextRelease(context);
free(inputPixels);
return processedImage;
}
As result i have this
Consider these are pixels in the middle of image:
|_|_|_|_|
|_|_|_|_|
|_|_|_|_|
|_|_|_|_|
Since you are updating image in place, this is how it looks somewhere in the middle of sharpen cycle:
|u|u|u|u|
|u|u|u|u|
|u|c|_|_|
|_|_|_|_|
Where u stands for updated pixel, c for current. So his new color depends on color of surround pixels, half of which are from already sharpened image, half from origin. To fix it we need a copy of original image's pixels:
...
CGContextDrawImage(context, CGRectMake(0, 0, inputWidth, inputHeight), inputCGImage);
UInt32 *origPixels = calloc(inputHeight * inputWidth, sizeof(UInt32));
memcpy(origPixels, inputPixels, inputHeight * inputWidth * sizeof(UInt32));
for (NSUInteger j = 1; j < inputHeight - 1; j++) {
...
And now we only need to change one line to get our current pixels from original image
//changed inputPixels -> origPixels
UInt32 * currentPixel = origPixels + ((j + filterMatrixJ - 1) * inputWidth) + i + filterMatrixI - 1;
Here are some examples of how it works compared to not updated filter (link is dropbox, sorry about that). I've tried different matrices, and as for me the best was somewhere around
const float sharpMatrix[3][3] = {{-0.3, -0.3, -0.3},{-0.3, 3.4, -0.3},{-0.3, -0.3, -0.3}}
Also, I need to notice that this way of keeping original image is not optimal. My fix basically doubles amount of memory consumed. It could be easily done via holding only two lines of pixels, and I'm sure there are even better ways.

iOS:CRC in obj c

i am new to iOS i need to create data packet by using CRC algorithm for the below commands
int comm[6];
comm[0]=0x01;
comm[1]=6;
comm[2]=0x70;
comm[3]=0x00;
comm[4]=0xFFFF;
comm[5]=0xFFFF;
i had a java code which as same thing developing in android
byte[] getCRC(byte[] bytes)
{
byte[] result = new byte[2];
try
{
short crc = (short) 0xFFFF;
for (int j = 0; j < bytes.length; j++)
{
byte c = bytes[j];
for (int i = 7; i >= 0; i--)
{
boolean c15 = ((crc >> 15 & 1) == 1)
boolean bit = ((c >> (7 - i) & 1) == 1);
crc <<= 1;
if (c15 ^ bit)
{
crc ^= 0x1021; // 0001 0000 0010 0001 (0, 5, 12)
}
}
}
int crc2 = crc - 0xffff0000;
result[0] = (byte) (crc2 % 256);
result[1] = (byte) (crc2 / 256);
return result;
}
catch(Exception ex)
{
result = null;
return result;
}
}
Input for getCRC() method: The data packet for which CRC is to be calculated.
Output of getCRC() method: CRC for the packet.
The same thing i need to do in obj c please help if any sample code available also.
Objective-C also incorporates C, so the contents of your method will look almost the same as in Java. All that is needed is to pass your data into and out of the method, in this example using NSData:
- (NSData *)bytesCRCResult:(NSData *)dataBytes
{
unsigned char *result = (unsigned char *)malloc(2);
unsigned char *bytes = (unsigned char *)[dataBytes bytes]; // returns readonly pointer to the byte stream
uint16_t crc = (short) 0xFFFF;
for (int j = 0; j < dataBytes.length; j++)
{
unsigned char c = bytes[j];
for (int i = 7; i >= 0; i--)
{
bool c15 = ((crc >> 15 & 1) == 1);
bool bit = ((c >> (7 - i) & 1) == 1);
crc <<= 1;
if (c15 ^ bit)
{
crc ^= 0x1021; // 0001 0000 0010 0001 (0, 5, 12)
}
}
}
uint16_t crc2 = crc - 0xffff0000;
result[0] = (unsigned char) (crc2 % 256);
result[1] = (unsigned char) (crc2 / 256);
NSData *resultsToData = [NSData dataWithBytes:result length:2];
free(result);
return resultsToData;
}
NSData can be read as raw bytes using the [NSData bytes] method call, and has a range of useful properties and methods.
For the boolean value, you have a few options:
"bool" seems to be the ISO C/C++ standard type
"Boolean" is defined as "typedef unsigned char"
"boolean_t" is defined as "typedef unsigned int" or "typedef int", depending on 64-bit compilation apparently
"BOOL", the Objective-C bool, which is defined as "typedef signed char", according to http://nshipster.com/bool/ and might therefore not behave as expected.
"uint8_t" can be substituted for "unsigned char", for clarity.
Please note: The above code compiles without warning or complaint, but wasn't tested with actual data.

The scripted value is neither an array nor a pointer

So I am trying to allocate memory for a 2D array of ints such that I can reference it outside of the loop in which the size is determined. (I have a scope issue because the size of the array isn't fixed.)
So this was the proposed solution, but I am getting the error "The scripted value is neither an array nor a pointer". Anyone know what I am doing wrong?
//M and m are just 2 int numbers
int X = self.create2dArray(M,m);
for(int kk = 0; kk < M; kk++)
{
for (int kk1 = 0; kk1 < m; kk1++)
{
//small "x" is an NSMutableArray of NSNumbers. So I am just running the 2 for loops to fill the whole 2D array
X[kk][kk1] = [[x objectAtIndex: (kk + kk1 * J)] intValue]; //ERROR Line
}
}
//outside of Main
static inline int **create2dArray(int w, int h)
{
size_t size = sizeof(int) * 2 + w * sizeof(int *);
int **arr = malloc(size);
int *sizes = (int *) arr;
sizes[0] = w;
sizes[1] = h;
arr = (int **) (sizes + 2);
for (int i = 0; i < w; i++)
{
arr[i] = calloc(h, sizeof(**arr));
}
return arr;
}
I believe that first line should start with int** X instead of int X
Okay, so your problem may most likely lie within your manually allocating of the memory for the integers. My proposed solution is to just fill it up with random "filler" numbers, for instance: 0. By doing so, you don't risk messing up the allocation process. In addition, it's much easier and it works given you'll be filling it up with integers later on. Hope this helped!

Reading PVRTC image color information for each pixel

How do I read the image color information for each pixel of PVRTC image?
Here is my code extracting the integer arrays
NSData *data = [[NSData alloc] initWithContentsOfFile:path];
NSMutableArray *_imageData = [[NSMutableArray alloc] initWithCapacity:10];
BOOL success = FALSE;
PVRTexHeader *header = NULL;
uint32_t flags, pvrTag;
uint32_t dataLength = 0, dataOffset = 0, dataSize = 0;
uint32_t blockSize = 0, widthBlocks = 0, heightBlocks = 0;
uint32_t width = 0, height = 0, bpp = 4;
uint8_t *bytes = NULL;
uint32_t formatFlags;
header = (PVRTexHeader *)[data bytes];
pvrTag = CFSwapInt32LittleToHost(header->pvrTag);
if (gPVRTexIdentifier[0] != ((pvrTag >> 0) & 0xff) ||
gPVRTexIdentifier[1] != ((pvrTag >> 8) & 0xff) ||
gPVRTexIdentifier[2] != ((pvrTag >> 16) & 0xff) ||
gPVRTexIdentifier[3] != ((pvrTag >> 24) & 0xff))
{
return FALSE;
}
flags = CFSwapInt32LittleToHost(header->flags);
formatFlags = flags & PVR_TEXTURE_FLAG_TYPE_MASK;
if (formatFlags == kPVRTextureFlagTypePVRTC_4 || formatFlags == kPVRTextureFlagTypePVRTC_2)
{
[_imageData removeAllObjects];
if (formatFlags == kPVRTextureFlagTypePVRTC_4)
_internalFormat = GL_COMPRESSED_RGBA_PVRTC_4BPPV1_IMG;
else if (formatFlags == kPVRTextureFlagTypePVRTC_2)
_internalFormat = GL_COMPRESSED_RGBA_PVRTC_2BPPV1_IMG;
_width = width = CFSwapInt32LittleToHost(header->width);
_height = height = CFSwapInt32LittleToHost(header->height);
if (CFSwapInt32LittleToHost(header->bitmaskAlpha))
_hasAlpha = TRUE;
else
_hasAlpha = FALSE;
dataLength = CFSwapInt32LittleToHost(header->dataLength);
bytes = ((uint8_t *)[data bytes]) + sizeof(PVRTexHeader);
// Calculate the data size for each texture level and respect the minimum number of blocks
while (dataOffset < dataLength)
{
if (formatFlags == kPVRTextureFlagTypePVRTC_4)
{
blockSize = 4 * 4; // Pixel by pixel block size for 4bpp
widthBlocks = width / 4;
heightBlocks = height / 4;
bpp = 4;
}
else
{
blockSize = 8 * 4; // Pixel by pixel block size for 2bpp
widthBlocks = width / 8;
heightBlocks = height / 4;
bpp = 2;
}
// Clamp to minimum number of blocks
if (widthBlocks < 2)
widthBlocks = 2;
if (heightBlocks < 2)
heightBlocks = 2;
dataSize = widthBlocks * heightBlocks * ((blockSize * bpp) / 8);
[_imageData addObject:[NSData dataWithBytes:bytes+dataOffset length:dataSize]];
for (int i=0; i < mipmapCount; i++)
{
NSLog(#"width:%d, height:%d",width,height);
data = [[NSData alloc] initWithData:[_imageData objectAtIndex:i]];
NSLog(#"data length:%d",[data length]);
//extracted 20 sample data, but all u could see are large integer number
for(int i = 0; i < 20; i++){
NSLog(#"data[%d]:%d",i,data[i]);
}
PVRTC is a 4x4 (or 8x4) texel, block-based compression system that takes into account surrounding blocks to represent two low frequency images with which higher frequency modulation data is combined in order to produce the actual texel output. A better explanation is available here:
http://web.onetel.net.uk/~simonnihal/assorted3d/fenney03texcomp.pdf
So the values you're extracting are actually parts of the encoded blocks and these need to be decoded correctly in order to get sensible values.
There are two ways to get to the colour information: decode/decompress the PVR texture information using a software decompressor or render the texture using a POWERVR graphics core and then read the result back. I'll only discuss the first option here.
It's rather tricky to assemble a decompressor from only the information there, but fortunately there's C++ decompression source code in the POWERVR SDK which you can get here - download one of the iPhone SDKs for instance:
http://www.imgtec.com/powervr/insider/powervr-sdk.asp
It's in the Tools/PVRTDecompress.cpp file.
Hope that helps.