ARM: saturate signed int to unsigned byte - optimization

saturating instructions saturate unsigned to unsigned or signed to signed int.
What's the best way to saturate signed 16-bit ints to unsigned byte?
In short, here's the logic
uint8_t usat8(uint8_t u8, int16_t s16)
{
s16 += u8;
if(s16 <= 0) {
return 0;
} else if(s16 >=255){
return 255;
}else{
return (uint8_t)s16;
}
}
void add_row(uint8_t * dst, uint8_t * u8, int16_t * s16)
{
for(int i=0; i<XXX; ++i)
{
dst[i] = usat8(u8[i] + s16[i]);
}
}
values of s16 are usually not much off from the [0, 255] range, e.g. it's safe to assume that abs(s16[x]) < 1000.
EDIT: I just realized that USAT16 actually saturates signed 16-bit int to unsigned integer. Simple USAT16 is the solution to the problem.

After 5 mins of thinking I have this idea (pseudo arm-asm):
sadd16 sum, s16, u8 # do two additions in parallel
orr signs, 0x1001, sum, lsr #15 # extract signs of the two 16 bit results
usat16 sum, sum, #8 # saturate both of the 16-bit sums to unsigned 8-byte range
uadd16 sum, sum, signs
this way, if sign bit was set for any of the sums the resulting sum will become 256, or 0x100. When writing back the data the shifted out 0x1 will be discarded.
Any comments, does that seem like the optimal approach, is there any better alternative?
PS. I do it for an armv6 device, no NEON or armv6t2

Related

Why the bitfield's least significant bit is promoted to MSb during typecasting in the below program?

Why do we get this value as output:- ffffffff
struct bitfield {
signed char bitflag:1;
};
int main()
{
unsigned char i = 1;
struct bitfield *var = (struct bitfield*)&i;
printf("\n %x \n", var->bitflag);
return 0;
}
I know that in a memory block of size equal to the data-type, the first bit is used to represent if it is positive(0) or negative(1); when interpreted as a signed data-type. But, still can't figure out why -1 (ffffffff) is printed. When the struct with only one bit set, I was expecting that when it gets promoted to a 1 byte char. Because, my machine is a little-endian and I was expecting that one bit in the field to be interpreted as the LSb in my 1 byte character.
Can somehow please explain. I'm really confused.

Gather AVX2&512 intrinsic for 16-bit integers?

Imagine this piece of code:
void Function(int16 *src, int *indices, float *dst, int cnt, float mul)
{
for (int i=0; i<cnt; i++) dst[i] = float(src[indices[i]]) * mul;
};
This really asks for gather intrinsics e.g. _mm_i32gather_epi32. I got great success with these when loading floats, but are there any for 16-bit ints? Another problem here is that I need to transition from 16-bits on the input to 32-bits (float) on the output.
There is indeed no instruction to gather 16bit integers, but (assuming there is no risk of memory-access violation) you can just load 32bit integers starting at the corresponding addresses, and mask out the upper halves of each value.
For uint16_t this would be a simple bit-and, for signed integers you can shift the values to the left in order to have the sign bit at the most-significant position. You can then (arithmetically) shift back the values before converting them to float, or, since you multiply them anyway, just scale the multiplication factor accordingly.
Alternatively, you could load from two bytes earlier and arithmetically shift to the right. Either way, your bottle-neck will likely be the load-ports (vpgatherdd requires 8 load-uops. Together with the load for the indices you have 9 loads distributed on two ports, which should result in 4.5 cycles for 8 elements).
Untested possible AVX2 implementation (does not handle the last elements, if cnt is not a multiple of 8 just execute your original loop at the end):
void Function(int16_t const *src, int const *indices, float *dst, size_t cnt, float mul_)
{
__m256 mul = _mm256_set1_ps(mul_*float(1.0f/0x10000));
for (size_t i=0; i+8<=cnt; i+=8){ // todo handle last elements
// load indicies:
__m256i idx = _mm256_loadu_si256(reinterpret_cast<__m256i const*>(indices + i));
// load 16bit integers in the lower halves + garbage in the upper halves:
__m256i values = _mm256_i32gather_epi32(reinterpret_cast<int const*>(src), idx, 2);
// shift each value to upper half (removes garbage, makes sure sign is at the right place)
// values are too large by a factor of 0x10000
values = _mm256_slli_epi32(values, 16);
// convert to float, scale and multiply:
__m256 fvalues = _mm256_mul_ps(_mm256_cvtepi32_ps(values), mul);
// store result
_mm256_storeu_ps(dst, fvalues);
}
}
Porting this to AVX-512 should be straight-forward.

Implementing a side channel timing attack

I'm working on a project implementing a side channel timing attack in C on HMAC. I've done so by computing the hex encoded tag and brute forcing byte-by-byte by taking advantage of strcmp's timing optimization. So for every digit in my test tag, I calculate the amount of time it takes for every hex char to verify. I take the hex char that corresponds to the highest amount of time calculated and infer that it is the correct char in the tag and move on to the next byte. However, strcmp's timing is very unpredictable. Although it is easy to see the timing differences between comparing two equal strings and two totally different strings, I'm having difficulty finding the char that takes my test string the most time to compute when every other string I'm comparing to is very similar (only differing by 1 byte).
The changeByte method below takes in customTag, which is the tag that has been computed up to that point in time and attempts to find the correct byte corresponding to index. changeByte is called n time where n=length of the tag. hexTag is a global variable that is the correct tag. timeCompleted stores the average time taken to compute the testTag at each of the hex characters for a char position. Any help would be appreciated, thank you for your time.
// Checks if the index of the given byte is correct or not
void changeByte(unsigned char *k, unsigned char * m, unsigned char * algorithm, unsigned char * customTag, int index)
{
long iterations=50000;
// used for every byte sequence to test the timing
unsigned char * tempTag = (unsigned char *)(malloc(sizeof (unsigned char)*(strlen(customTag)+1 ) ));
sprintf(tempTag, "%s", customTag);
int timeIndex=0;
// stores the time completed for every respective ascii char
double * timeCompleted = (double *)(malloc (sizeof (double) * 16));
// iterates through hex char 0-9, a-f
for (int i=48; i<=102;i++){
if (i >= 58 && i <=96)continue;
double total=0;
for (long j=0; j<iterations; j++){
// calculates the time it takes to complete for every char in that position
tempTag[index]=(unsigned char)i;
struct rusage usage;
struct timeval start, end;
getrusage(RUSAGE_SELF, &usage);
start=usage.ru_stime;
for (int k=0; k<50000; k++)externalStrcmp(tempTag, hexTag); // this is just calling strcmp in another file
getrusage (RUSAGE_SELF, &usage);
end=usage.ru_stime;
}
double startTime=((double)start.tv_sec + (double)start.tv_usec)/10000;
double endTime=((double)end.tv_sec+(double)end.tv_usec)/10000;
total+=endTime-startTime;
}
double val=total/iterations;
timeCompleted[timeIndex]=val;
timeIndex++;
}
// sets next char equal to the hex char corresponding to the index
customTag[index]=getCorrectChar (timeCompleted);
free(timeCompleted);
free(tempTag);
}
// finds the highest time. The hex char corresponding with the highest time it took the
// verify function to complete is the correct one
unsigned char getCorrectChar(double * timeCompleted)
{
double high =-1;
int index=0;
for (int i=0; i<16; i++){
if (timeCompleted[i]>high){
high=timeCompleted[i];
index=i;
}
}
return (index+48)<=57 ?(unsigned char) (index+48) : (unsigned char)(index+87);
}
I'm not sure if it's the main problem, but you add seconds to microseconds directly as though 1us == 1s. It will give wrong results when number of seconds in startTime and endTime differs.
And the scaling factor between usec and sec is 1 000 000 (thx zaph). So that should work better:
double startTime=(double)start.tv_sec + (double)start.tv_usec/1000000;
double endTime=(double)end.tv_sec + (double)end.tv_usec/1000000;

are 2^n exponent calculations really less efficient than bit-shifts?

if I do:
int x = 4;
pow(2, x);
Is that really that much less efficient than just doing:
1 << 4
?
Yes. An easy way to show this is to compile the following two functions that do the same thing and then look at the disassembly.
#include <stdint.h>
#include <math.h>
uint32_t foo1(uint32_t shftAmt) {
return pow(2, shftAmt);
}
uint32_t foo2(uint32_t shftAmt) {
return (1 << shftAmt);
}
cc -arch armv7 -O3 -S -o - shift.c (I happen to find ARM asm easier to read but if you want x86 just remove the arch flag)
_foo1:
# BB#0:
push {r7, lr}
vmov s0, r0
mov r7, sp
vcvt.f64.u32 d16, s0
vmov r0, r1, d16
blx _exp2
vmov d16, r0, r1
vcvt.u32.f64 s0, d16
vmov r0, s0
pop {r7, pc}
_foo2:
# BB#0:
movs r1, #1
lsl.w r0, r1, r0
bx lr
You can see foo2 only takes 2 instructions vs foo1 which takes several instructions. It has to move the data to the FP HW registers (vmov), convert the integer to a float (vcvt.f64.u32) call the exp function and then convert the answer back to an uint (vcvt.u32.f64) and move it from the FP HW back to the GP registers.
Yes. Though by how much I can't say. The easiest way to determine that is to benchmark it.
The pow function uses doubles... At least, if it conforms to the C standard. Even if that function used bitshift when it sees a base of 2, there would still be testing and branching to reach that conclusion, by which time your simple bitshift would be completed. And we haven't even considered the overhead of a function call yet.
For equivalency, I assume you meant to use 1 << x instead of 1 << 4.
Perhaps a compiler could optimize both of these, but it's far less likely to optimize a call to pow. If you need the fastest way to compute a power of 2, do it with shifting.
Update... Since I mentioned it's easy to benchmark, I decided to do just that. I happen to have Windows and Visual C++ handy so I used that. Results will vary. My program:
#include <Windows.h>
#include <cstdio>
#include <cmath>
#include <ctime>
LARGE_INTEGER liFreq, liStart, liStop;
inline void StartTimer()
{
QueryPerformanceCounter(&liStart);
}
inline double ReportTimer()
{
QueryPerformanceCounter(&liStop);
double milli = 1000.0 * double(liStop.QuadPart - liStart.QuadPart) / double(liFreq.QuadPart);
printf( "%.3f ms\n", milli );
return milli;
}
int main()
{
QueryPerformanceFrequency(&liFreq);
const size_t nTests = 10000000;
int x = 4;
int sumPow = 0;
int sumShift = 0;
double powTime, shiftTime;
// Make an array of random exponents to use in tests.
const size_t nExp = 10000;
int e[nExp];
srand( (unsigned int)time(NULL) );
for( int i = 0; i < nExp; i++ ) e[i] = rand() % 31;
// Test power.
StartTimer();
for( size_t i = 0; i < nTests; i++ )
{
int y = (int)pow(2, (double)e[i%nExp]);
sumPow += y;
}
powTime = ReportTimer();
// Test shifting.
StartTimer();
for( size_t i = 0; i < nTests; i++ )
{
int y = 1 << e[i%nExp];
sumShift += y;
}
shiftTime = ReportTimer();
// The compiler shouldn't optimize out our loops if we need to display a result.
printf( "Sum power: %d\n", sumPow );
printf( "Sum shift: %d\n", sumShift );
printf( "Time ratio of pow versus shift: %.2f\n", powTime / shiftTime );
system("pause");
return 0;
}
My output:
379.466 ms
15.862 ms
Sum power: 157650768
Sum shift: 157650768
Time ratio of pow versus shift: 23.92
That depends on the compiler, but in general (when the compiler is not totally braindead) yes, the shift is one CPU instruction, the other is a function call, that involves saving the current state an setting up a stack frame, that requires many instructions.
Generally yes, as bit shift is very basic operation for the processor.
On the other hand many compilers optimise code so that raising to power is in fact just a bit shifting.

Functions to compress and uncompress array of integers

I was recently asked to complete a task for a c++ role, however as the application was decided not to be progressed any further I thought that I would post here for some feedback / advice / improvements / reminder of concepts I've forgotten.
The task was:
The following data is a time series of integer values
int timeseries[32] = {67497, 67376, 67173, 67235, 67057, 67031, 66951,
66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044,
67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620,
66579, 66596, 66713, 66852, 66715};
The series might be, for example, the closing price of a stock each day
over a 32 day period.
As stored above, the data will occupy 32 x sizeof(int) bytes = 128 bytes
assuming 4 byte ints.
Using delta encoding , write a function to compress, and a function to
uncompress data like the above.
Ok, so before this point I had never looked at compression so my solution is far from perfect. The manner in which I approached the problem is by compressing the array of integers into a array of bytes. When representing the integer as a byte I keep the calculate most
significant byte (msb) and keep everything up to this point, whilst throwing the rest away. This is then added to the byte array. For negative values I increment the msb by 1 so that we can
differentiate between positive and negative bytes when decoding by keeping the leading
1 bit values.
When decoding I parse this jagged byte array and simply reverse my
previous actions performed when compressing. As mentioned I have never looked at compression prior to this task so I did come up with my own method to compress the data. I was looking at C++/Cli recently, had not really used it previously so just decided to write it in this language, no particular reason. Below is the class, and a unit test at the very bottom. Any advice / improvements / enhancements will be much appreciated.
Thanks.
array<array<Byte>^>^ CDeltaEncoding::CompressArray(array<int>^ data)
{
int temp = 0;
int original;
int size = 0;
array<int>^ tempData = gcnew array<int>(data->Length);
data->CopyTo(tempData, 0);
array<array<Byte>^>^ byteArray = gcnew array<array<Byte>^>(tempData->Length);
for (int i = 0; i < tempData->Length; ++i)
{
original = tempData[i];
tempData[i] -= temp;
temp = original;
int msb = GetMostSignificantByte(tempData[i]);
byteArray[i] = gcnew array<Byte>(msb);
System::Buffer::BlockCopy(BitConverter::GetBytes(tempData[i]), 0, byteArray[i], 0, msb );
size += byteArray[i]->Length;
}
return byteArray;
}
array<int>^ CDeltaEncoding::DecompressArray(array<array<Byte>^>^ buffer)
{
System::Collections::Generic::List<int>^ decodedArray = gcnew System::Collections::Generic::List<int>();
int temp = 0;
for (int i = 0; i < buffer->Length; ++i)
{
int retrievedVal = GetValueAsInteger(buffer[i]);
decodedArray->Add(retrievedVal);
decodedArray[i] += temp;
temp = decodedArray[i];
}
return decodedArray->ToArray();
}
int CDeltaEncoding::GetMostSignificantByte(int value)
{
array<Byte>^ tempBuf = BitConverter::GetBytes(Math::Abs(value));
int msb = tempBuf->Length;
for (int i = tempBuf->Length -1; i >= 0; --i)
{
if (tempBuf[i] != 0)
{
msb = i + 1;
break;
}
}
if (!IsPositiveInteger(value))
{
//We need an extra byte to differentiate the negative integers
msb++;
}
return msb;
}
bool CDeltaEncoding::IsPositiveInteger(int value)
{
return value / Math::Abs(value) == 1;
}
int CDeltaEncoding::GetValueAsInteger(array<Byte>^ buffer)
{
array<Byte>^ tempBuf;
if(buffer->Length % 2 == 0)
{
//With even integers there is no need to allocate a new byte array
tempBuf = buffer;
}
else
{
tempBuf = gcnew array<Byte>(4);
System::Buffer::BlockCopy(buffer, 0, tempBuf, 0, buffer->Length );
unsigned int val = buffer[buffer->Length-1] &= 0xFF;
if ( val == 0xFF )
{
//We have negative integer compressed into 3 bytes
//Copy over the this last byte as well so we keep the negative pattern
System::Buffer::BlockCopy(buffer, buffer->Length-1, tempBuf, buffer->Length, 1 );
}
}
switch(tempBuf->Length)
{
case sizeof(short):
return BitConverter::ToInt16(tempBuf,0);
case sizeof(int):
default:
return BitConverter::ToInt32(tempBuf,0);
}
}
And then in a test class I had:
void CTestDeltaEncoding::TestCompression()
{
array<array<Byte>^>^ byteArray = CDeltaEncoding::CompressArray(m_testdata);
array<int>^ decompressedArray = CDeltaEncoding::DecompressArray(byteArray);
int totalBytes = 0;
for (int i = 0; i<byteArray->Length; i++)
{
totalBytes += byteArray[i]->Length;
}
Assert::IsTrue(m_testdata->Length * sizeof(m_testdata) > totalBytes, "Expected the total bytes to be less than the original array!!");
//Expected totalBytes = 53
}
This smells a lot like homework to me. The crucial phrase is: "Using delta encoding."
Delta encoding means you encode the delta (difference) between each number and the next:
67497, 67376, 67173, 67235, 67057, 67031, 66951, 66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044, 67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620, 66579, 66596, 66713, 66852, 66715
would turn into:
[Base: 67497]: -121, -203, +62
and so on. Assuming 8-bit bytes, the original numbers require 3 bytes apiece (and given the number of compilers with 3-byte integer types, you're normally going to end up with 4 bytes apiece). From the looks of things, the differences will fit quite easily in 2 bytes apiece, and if you can ignore one (or possibly two) of the least significant bits, you can fit them in one byte apiece.
Delta encoding is most often used for things like sound encoding where you can "fudge" the accuracy at times without major problems. For example, if you have a change from one sample to the next that's larger than you've left space to encode, you can encode a maximum change in the current difference, and add the difference to the next delta (and if you don't mind some back-tracking, you can distribute some to the previous delta as well). This will act as a low-pass filter, limiting the gradient between samples.
For example, in the series you gave, a simple delta encoding requires ten bits to represent all the differences. By dropping the LSB, however, nearly all the samples (all but one, in fact) can be encoded in 8 bits. That one has a difference (right shifted one bit) of -173, so if we represent it as -128, we have 45 left. We can distribute that error evenly between the preceding and following sample. In that case, the output won't be an exact match for the input, but if we're talking about something like sound, the difference probably won't be particularly obvious.
I did mention that it was an exercise that I had to complete and the solution that I received was deemed not good enough, so I wanted some constructive feedback seeing as actual companies never decide to tell you what you did wrong.
When the array is compressed I store the differences and not the original values except the first as this was my understanding. If you had looked at my code I have provided a full solution but my question was how bad was it?