Efficient algorithm to convert(sum) 128-bit data in q-register to 16-bit data - embedded

I have 128-bit data in q-register. I want to sum the individual 16-bit block in this q-register to finally have a 16-bit final sum (any carry beyond 16-bit should be taken and added to the LSB of this 16-bit num).
what I want to achieve is:
VADD.U16 (some 16-bit variable) {q0[0] q0[1] q0[2] ......... q0[7]}
but using intrinsics,
would appreciate if someone could give me an algorithm for this.
I tried using pair-wise addition, but I'm ending up with rather a clumsy solution..
Heres how it looks:
int convert128to16(uint16x8_t data128){
uint16_t data16 = 0;
uint16x4_t ddata;
print16(data128);
uint32x4_t data = vpaddlq_u16(data128);
print32(data);
uint16x4_t data_hi = vget_high_u16(data);
print16x4(data_hi);
uint16x4_t data_low = vget_low_u16(data);
print16x4(data_low);
ddata = vpadd_u16( data_hi, data_low);
print16x4(ddata);
}
It's still incomplete and a bit clumsy.. Any help would be much appreciated.

You can use the horizontal add instructions:
Here is a fragment:
uint16x8_t input = /* load your data128 here */
uint64x2_t temp = vpaddlq_u32 (vpaddlq_u16 (input));
uint64x1_t result = vadd_u64 (vget_high_u64 (temp),
vget_low_u64 (temp));
// result now contains the sum of all 16 bit unsigned words
// stored in data128.
// to add the values that overflow from 16 bit just do another 16 bit
// horizontal addition and return the lowest 16 bit as the final result:
uint16x4_t w = vpadd_u16 (
vreinterpret_u16_u64 (result),
vreinterpret_u16_u64 (result));
uint16_t wrappedResult = vget_lane_u16 (w, 0);

I f your goal is to sum the 16 bit chunks (modulo 16 bit), the following fragment would do:
uin16_t convert128to16(uint16x8_t data128){
data128 += (data128 >> 64);
data128 += (data128 >> 32);
data128 += (data128 >> 16);
return data128 & 0xffff;
}

Related

Gather AVX2&512 intrinsic for 16-bit integers?

Imagine this piece of code:
void Function(int16 *src, int *indices, float *dst, int cnt, float mul)
{
for (int i=0; i<cnt; i++) dst[i] = float(src[indices[i]]) * mul;
};
This really asks for gather intrinsics e.g. _mm_i32gather_epi32. I got great success with these when loading floats, but are there any for 16-bit ints? Another problem here is that I need to transition from 16-bits on the input to 32-bits (float) on the output.
There is indeed no instruction to gather 16bit integers, but (assuming there is no risk of memory-access violation) you can just load 32bit integers starting at the corresponding addresses, and mask out the upper halves of each value.
For uint16_t this would be a simple bit-and, for signed integers you can shift the values to the left in order to have the sign bit at the most-significant position. You can then (arithmetically) shift back the values before converting them to float, or, since you multiply them anyway, just scale the multiplication factor accordingly.
Alternatively, you could load from two bytes earlier and arithmetically shift to the right. Either way, your bottle-neck will likely be the load-ports (vpgatherdd requires 8 load-uops. Together with the load for the indices you have 9 loads distributed on two ports, which should result in 4.5 cycles for 8 elements).
Untested possible AVX2 implementation (does not handle the last elements, if cnt is not a multiple of 8 just execute your original loop at the end):
void Function(int16_t const *src, int const *indices, float *dst, size_t cnt, float mul_)
{
__m256 mul = _mm256_set1_ps(mul_*float(1.0f/0x10000));
for (size_t i=0; i+8<=cnt; i+=8){ // todo handle last elements
// load indicies:
__m256i idx = _mm256_loadu_si256(reinterpret_cast<__m256i const*>(indices + i));
// load 16bit integers in the lower halves + garbage in the upper halves:
__m256i values = _mm256_i32gather_epi32(reinterpret_cast<int const*>(src), idx, 2);
// shift each value to upper half (removes garbage, makes sure sign is at the right place)
// values are too large by a factor of 0x10000
values = _mm256_slli_epi32(values, 16);
// convert to float, scale and multiply:
__m256 fvalues = _mm256_mul_ps(_mm256_cvtepi32_ps(values), mul);
// store result
_mm256_storeu_ps(dst, fvalues);
}
}
Porting this to AVX-512 should be straight-forward.

Objective c, Scanf() string taking in the same value twice

Hi all I am having a strange issue, when i use scanf to input data it repeats strings and saves them as one i am not sure why.
Please Help
/* Assment Label loop - Loops through the assment labels and inputs the percentage and the name for it. */
i = 0;
j = 0;
while (i < totalGradedItems)
{
scanf("%s%d", assLabel[i], &assPercent[i]);
i++;
}
/* Print Statement */
i = 0;
while (i < totalGradedItems)
{
printf("%s", assLabel[i]);
i++;
}
Input Data
Prog1 20
Quiz 20
Prog2 20
Mdtm 15
Final 25
Output Via Console
Prog1QuizQuizProg2MdtmMdtmFinal
Final diagnosis
You don't show your declarations...but you must be allocating just 5 characters for the strings:
When I adjust the enum MAX_ASSESSMENTLEN from 10 to 5 (see the code below) I get the output:
Prog1Quiz 20
Quiz 20
Prog2Mdtm 20
Mdtm 15
Final 25
You did not allow for the terminal null. And you didn't show us what was causing the bug! And the fact that you omitted newlines from the printout obscured the problem.
What's happening is that 'Prog1' is occupying all 5 bytes of the string you read in, and is writing a null at the 6th byte; then Quiz is being read in, starting at the sixth byte.
When printf() goes to read the string for 'Prog1', it stops at the first null, which is the one after the 'z' of 'Quiz', producing the output shown. Repeat for 'Prog2' and 'Mtdm'. If there was an entry after 'Final', it too would suffer. You are lucky that there are enough zero bytes around to prevent any monstrous overruns.
This is a basic buffer overflow (indeed, since the array is on the stack, it is a basic Stack Overflow); you are trying to squeeze 6 characters (Prog1 plus '\0') into a 5 byte space, and it simply does not work well.
Preliminary diagnosis
First, print newlines after your data.
Second, check that scanf() is not returning errors - it probably isn't, but neither you nor we can tell for sure.
Third, are you sure that the data file contains what you say? Plausibly, it contains a pair of 'Quiz' and a pair of 'Mtdm' lines.
Your variable j is unused, incidentally.
You would probably be better off having the input loop run until you are either out of space in the receiving arrays or you get a read failure. However, the code worked for me when dressed up slightly:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
char assLabel[10][10];
int assPercent[10];
int i = 0;
int totalGradedItems = 5;
while (i < totalGradedItems)
{
if (scanf("%9s%d", assLabel[i], &assPercent[i]) != 2)
{
fprintf(stderr, "Error reading\n");
exit(1);
}
i++;
}
/* Print Statement */
i = 0;
while (i < totalGradedItems)
{
printf("%-9s %3d\n", assLabel[i], assPercent[i]);
i++;
}
return 0;
}
For the quoted input data, the output results are:
Prog1 20
Quiz 20
Prog2 20
Mdtm 15
Final 25
I prefer this version, though:
#include <stdio.h>
enum { MAX_GRADES = 10 };
enum { MAX_ASSESSMENTLEN = 10 };
int main(void)
{
char assLabel[MAX_GRADES][MAX_ASSESSMENTLEN];
int assPercent[MAX_GRADES];
int i = 0;
int totalGradedItems;
for (i = 0; i < MAX_GRADES; i++)
{
if (scanf("%9s%d", assLabel[i], &assPercent[i]) != 2)
break;
}
totalGradedItems = i;
for (i = 0; i < totalGradedItems; i++)
printf("%-9s %3d\n", assLabel[i], assPercent[i]);
return 0;
}
Of course, if I'd set up the scanf() format string 'properly' (meaning safely) so as to limit the length of the assessment names to fit into the space allocated, then the loop would stop reading on the second attempt:
...
char format[10];
...
snprintf(format, sizeof(format), "%%%ds%%d", MAX_ASSESSMENTLEN-1);
...
if (scanf(format, assLabel[i], &assPercent[i]) != 2)
With MAX_ASSESSMENTLEN at 5, the snprintf() generates the format string "%4s%d". The code compiled reads:
Prog 1
and stops. The '1' comes from the 5th character of 'Prog1'; the next assessment name is '20', and then the conversion of 'Quiz' into a number fails, causing the input loop to stop (because only one of two expected items was converted).
Despite the nuisance value, if you want to make your scanf() strings adjust to the size of the data variables it is reading into, you have to do something akin to what I did here - format the string using the correct size values.
i guess, you need to put a
scanf("%s%d", assLabel[i], &assPercent[i]);
space between %s and %d here.
And it is not saving as one. You need to put newline or atlease a space after %s on print to see difference.
add:
when i tried
#include <stdio.h>
int main (int argc, const char * argv[])
{
char a[1][2];
for(int i =0;i<3;i++)
scanf("%s",a[i]);
for(int i =0;i<3;i++)
printf("%s",a[i]);
return 0;
}
with inputs
123456
qwerty
sdfgh
output is:
12qwsdfghqwsdfghsdfgh
that proves that, the size of string array need to be bigger then decleared there.

Question regarding ip checksum code

unsigned short /* this function generates header checksums */
csum (unsigned short *buf, int nwords)
{
unsigned long sum;
for (sum = 0; nwords > 0; nwords--) // add words(16bits) together
{
sum += *buf++;
}
sum = (sum >> 16) + (sum & 0xffff); //add carry over
sum += (sum >> 16); //MY question: what exactly does this step do??? add possible left-over
//byte? But hasn't it already been added in the loop (if
//any)?
return ((unsigned short) ~sum);
}
I assume nwords in the number of 16bits word, not 8bits byte (if there are odd byte, nword is rounded to next large), is it correct? Say ip_hdr has 27 bytes totally, then nword will be 14 instead of 13, right?
The line sum = (sum >> 16) + (sum & 0xffff) is to add carry over to make 16bit complement
sum += (sum >> 16); What's the purpose of this step? Add left-over byte? But left-over byte has already been added in the loop?
Thanks!
You're correct. Step 3 condenses sum, a 32-bit long, into a 16-bit unsigned short, which is the length of the checksum. This is for performance purposes, allowing one to calculate the checksum without tracking overflow until the end. It does this both in step 2 and step 3 because it may have overflowed from step 2. Then it returns just the inverted lower 16 bits of sum.
This is a bit clearer:
http://www.sysnet.ucsd.edu/~cfleizac/iptcphdr.html

Non repeating random numbers in Objective-C

I'm using
for (int i = 1, i<100, i++)
int i = arc4random() % array count;
but I'm getting repeats every time. How can I fill out the chosen int value from the range, so that when the program loops I will not get any dupe?
It sounds like you want shuffling of a set rather than "true" randomness. Simply create an array where all the positions match the numbers and initialize a counter:
num[ 0] = 0
num[ 1] = 1
: :
num[99] = 99
numNums = 100
Then, whenever you want a random number, use the following method:
idx = rnd (numNums); // return value 0 through numNums-1
val = num[idx]; // get then number at that position.
num[idx] = val[numNums-1]; // remove it from pool by overwriting with highest
numNums--; // and removing the highest position from pool.
return val; // give it back to caller.
This will return a random value from an ever-decreasing pool, guaranteeing no repeats. You will have to beware of the pool running down to zero size of course, and intelligently re-initialize the pool.
This is a more deterministic solution than keeping a list of used numbers and continuing to loop until you find one not in that list. The performance of that sort of algorithm will degrade as the pool gets smaller.
A C function using static values something like this should do the trick. Call it with
int i = myRandom (200);
to set the pool up (with any number zero or greater specifying the size) or
int i = myRandom (-1);
to get the next number from the pool (any negative number will suffice). If the function can't allocate enough memory, it will return -2. If there's no numbers left in the pool, it will return -1 (at which point you could re-initialize the pool if you wish). Here's the function with a unit testing main for you to try out:
#include <stdio.h>
#include <stdlib.h>
#define ERR_NO_NUM -1
#define ERR_NO_MEM -2
int myRandom (int size) {
int i, n;
static int numNums = 0;
static int *numArr = NULL;
// Initialize with a specific size.
if (size >= 0) {
if (numArr != NULL)
free (numArr);
if ((numArr = malloc (sizeof(int) * size)) == NULL)
return ERR_NO_MEM;
for (i = 0; i < size; i++)
numArr[i] = i;
numNums = size;
}
// Error if no numbers left in pool.
if (numNums == 0)
return ERR_NO_NUM;
// Get random number from pool and remove it (rnd in this
// case returns a number between 0 and numNums-1 inclusive).
n = rand() % numNums;
i = numArr[n];
numArr[n] = numArr[numNums-1];
numNums--;
if (numNums == 0) {
free (numArr);
numArr = 0;
}
return i;
}
int main (void) {
int i;
srand (time (NULL));
i = myRandom (20);
while (i >= 0) {
printf ("Number = %3d\n", i);
i = myRandom (-1);
}
printf ("Final = %3d\n", i);
return 0;
}
And here's the output from one run:
Number = 19
Number = 10
Number = 2
Number = 15
Number = 0
Number = 6
Number = 1
Number = 3
Number = 17
Number = 14
Number = 12
Number = 18
Number = 4
Number = 9
Number = 7
Number = 8
Number = 16
Number = 5
Number = 11
Number = 13
Final = -1
Keep in mind that, because it uses statics, it's not safe for calling from two different places if they want to maintain their own separate pools. If that were the case, the statics would be replaced with a buffer (holding count and pool) that would "belong" to the caller (a double-pointer could be passed in for this purpose).
And, if you're looking for the "multiple pool" version, I include it here for completeness.
#include <stdio.h>
#include <stdlib.h>
#define ERR_NO_NUM -1
#define ERR_NO_MEM -2
int myRandom (int size, int *ppPool[]) {
int i, n;
// Initialize with a specific size.
if (size >= 0) {
if (*ppPool != NULL)
free (*ppPool);
if ((*ppPool = malloc (sizeof(int) * (size + 1))) == NULL)
return ERR_NO_MEM;
(*ppPool)[0] = size;
for (i = 0; i < size; i++) {
(*ppPool)[i+1] = i;
}
}
// Error if no numbers left in pool.
if (*ppPool == NULL)
return ERR_NO_NUM;
// Get random number from pool and remove it (rnd in this
// case returns a number between 0 and numNums-1 inclusive).
n = rand() % (*ppPool)[0];
i = (*ppPool)[n+1];
(*ppPool)[n+1] = (*ppPool)[(*ppPool)[0]];
(*ppPool)[0]--;
if ((*ppPool)[0] == 0) {
free (*ppPool);
*ppPool = NULL;
}
return i;
}
int main (void) {
int i;
int *pPool;
srand (time (NULL));
pPool = NULL;
i = myRandom (20, &pPool);
while (i >= 0) {
printf ("Number = %3d\n", i);
i = myRandom (-1, &pPool);
}
printf ("Final = %3d\n", i);
return 0;
}
As you can see from the modified main(), you need to first initialise an int pointer to NULL then pass its address to the myRandom() function. This allows each client (location in the code) to have their own pool which is automatically allocated and freed, although you could still share pools if you wish.
You could use Format-Preserving Encryption to encrypt a counter. Your counter just goes from 0 upwards, and the encryption uses a key of your choice to turn it into a seemingly random value of whatever radix and width you want.
Block ciphers normally have a fixed block size of e.g. 64 or 128 bits. But Format-Preserving Encryption allows you to take a standard cipher like AES and make a smaller-width cipher, of whatever radix and width you want (e.g. radix 2, width 16), with an algorithm which is still cryptographically robust.
It is guaranteed to never have collisions (because cryptographic algorithms create a 1:1 mapping). It is also reversible (a 2-way mapping), so you can take the resulting number and get back to the counter value you started with.
AES-FFX is one proposed standard method to achieve this. I've experimented with some basic Python code which is based on the AES-FFX idea, although not fully conformant--see Python code here. It can e.g. encrypt a counter to a random-looking 7-digit decimal number, or a 16-bit number.
You need to keep track of the numbers you have already used (for instance, in an array). Get a random number, and discard it if it has already been used.
Without relying on external stochastic processes, like radioactive decay or user input, computers will always generate pseudorandom numbers - that is numbers which have many of the statistical properties of random numbers, but repeat in sequences.
This explains the suggestions to randomise the computer's output by shuffling.
Discarding previously used numbers may lengthen the sequence artificially, but at a cost to the statistics which give the impression of randomness.
The best way to do this is create an array for numbers already used. After a random number has been created then add it to the array. Then when you go to create another random number, ensure that it is not in the array of used numbers.
In addition to using secondary array to store already generated random numbers, invoking random no. seeding function before every call of random no. generation function might help to generate different seq. of random numbers in every run.

Functions to compress and uncompress array of integers

I was recently asked to complete a task for a c++ role, however as the application was decided not to be progressed any further I thought that I would post here for some feedback / advice / improvements / reminder of concepts I've forgotten.
The task was:
The following data is a time series of integer values
int timeseries[32] = {67497, 67376, 67173, 67235, 67057, 67031, 66951,
66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044,
67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620,
66579, 66596, 66713, 66852, 66715};
The series might be, for example, the closing price of a stock each day
over a 32 day period.
As stored above, the data will occupy 32 x sizeof(int) bytes = 128 bytes
assuming 4 byte ints.
Using delta encoding , write a function to compress, and a function to
uncompress data like the above.
Ok, so before this point I had never looked at compression so my solution is far from perfect. The manner in which I approached the problem is by compressing the array of integers into a array of bytes. When representing the integer as a byte I keep the calculate most
significant byte (msb) and keep everything up to this point, whilst throwing the rest away. This is then added to the byte array. For negative values I increment the msb by 1 so that we can
differentiate between positive and negative bytes when decoding by keeping the leading
1 bit values.
When decoding I parse this jagged byte array and simply reverse my
previous actions performed when compressing. As mentioned I have never looked at compression prior to this task so I did come up with my own method to compress the data. I was looking at C++/Cli recently, had not really used it previously so just decided to write it in this language, no particular reason. Below is the class, and a unit test at the very bottom. Any advice / improvements / enhancements will be much appreciated.
Thanks.
array<array<Byte>^>^ CDeltaEncoding::CompressArray(array<int>^ data)
{
int temp = 0;
int original;
int size = 0;
array<int>^ tempData = gcnew array<int>(data->Length);
data->CopyTo(tempData, 0);
array<array<Byte>^>^ byteArray = gcnew array<array<Byte>^>(tempData->Length);
for (int i = 0; i < tempData->Length; ++i)
{
original = tempData[i];
tempData[i] -= temp;
temp = original;
int msb = GetMostSignificantByte(tempData[i]);
byteArray[i] = gcnew array<Byte>(msb);
System::Buffer::BlockCopy(BitConverter::GetBytes(tempData[i]), 0, byteArray[i], 0, msb );
size += byteArray[i]->Length;
}
return byteArray;
}
array<int>^ CDeltaEncoding::DecompressArray(array<array<Byte>^>^ buffer)
{
System::Collections::Generic::List<int>^ decodedArray = gcnew System::Collections::Generic::List<int>();
int temp = 0;
for (int i = 0; i < buffer->Length; ++i)
{
int retrievedVal = GetValueAsInteger(buffer[i]);
decodedArray->Add(retrievedVal);
decodedArray[i] += temp;
temp = decodedArray[i];
}
return decodedArray->ToArray();
}
int CDeltaEncoding::GetMostSignificantByte(int value)
{
array<Byte>^ tempBuf = BitConverter::GetBytes(Math::Abs(value));
int msb = tempBuf->Length;
for (int i = tempBuf->Length -1; i >= 0; --i)
{
if (tempBuf[i] != 0)
{
msb = i + 1;
break;
}
}
if (!IsPositiveInteger(value))
{
//We need an extra byte to differentiate the negative integers
msb++;
}
return msb;
}
bool CDeltaEncoding::IsPositiveInteger(int value)
{
return value / Math::Abs(value) == 1;
}
int CDeltaEncoding::GetValueAsInteger(array<Byte>^ buffer)
{
array<Byte>^ tempBuf;
if(buffer->Length % 2 == 0)
{
//With even integers there is no need to allocate a new byte array
tempBuf = buffer;
}
else
{
tempBuf = gcnew array<Byte>(4);
System::Buffer::BlockCopy(buffer, 0, tempBuf, 0, buffer->Length );
unsigned int val = buffer[buffer->Length-1] &= 0xFF;
if ( val == 0xFF )
{
//We have negative integer compressed into 3 bytes
//Copy over the this last byte as well so we keep the negative pattern
System::Buffer::BlockCopy(buffer, buffer->Length-1, tempBuf, buffer->Length, 1 );
}
}
switch(tempBuf->Length)
{
case sizeof(short):
return BitConverter::ToInt16(tempBuf,0);
case sizeof(int):
default:
return BitConverter::ToInt32(tempBuf,0);
}
}
And then in a test class I had:
void CTestDeltaEncoding::TestCompression()
{
array<array<Byte>^>^ byteArray = CDeltaEncoding::CompressArray(m_testdata);
array<int>^ decompressedArray = CDeltaEncoding::DecompressArray(byteArray);
int totalBytes = 0;
for (int i = 0; i<byteArray->Length; i++)
{
totalBytes += byteArray[i]->Length;
}
Assert::IsTrue(m_testdata->Length * sizeof(m_testdata) > totalBytes, "Expected the total bytes to be less than the original array!!");
//Expected totalBytes = 53
}
This smells a lot like homework to me. The crucial phrase is: "Using delta encoding."
Delta encoding means you encode the delta (difference) between each number and the next:
67497, 67376, 67173, 67235, 67057, 67031, 66951, 66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044, 67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620, 66579, 66596, 66713, 66852, 66715
would turn into:
[Base: 67497]: -121, -203, +62
and so on. Assuming 8-bit bytes, the original numbers require 3 bytes apiece (and given the number of compilers with 3-byte integer types, you're normally going to end up with 4 bytes apiece). From the looks of things, the differences will fit quite easily in 2 bytes apiece, and if you can ignore one (or possibly two) of the least significant bits, you can fit them in one byte apiece.
Delta encoding is most often used for things like sound encoding where you can "fudge" the accuracy at times without major problems. For example, if you have a change from one sample to the next that's larger than you've left space to encode, you can encode a maximum change in the current difference, and add the difference to the next delta (and if you don't mind some back-tracking, you can distribute some to the previous delta as well). This will act as a low-pass filter, limiting the gradient between samples.
For example, in the series you gave, a simple delta encoding requires ten bits to represent all the differences. By dropping the LSB, however, nearly all the samples (all but one, in fact) can be encoded in 8 bits. That one has a difference (right shifted one bit) of -173, so if we represent it as -128, we have 45 left. We can distribute that error evenly between the preceding and following sample. In that case, the output won't be an exact match for the input, but if we're talking about something like sound, the difference probably won't be particularly obvious.
I did mention that it was an exercise that I had to complete and the solution that I received was deemed not good enough, so I wanted some constructive feedback seeing as actual companies never decide to tell you what you did wrong.
When the array is compressed I store the differences and not the original values except the first as this was my understanding. If you had looked at my code I have provided a full solution but my question was how bad was it?