I want to divide an AVX2 vector by a constant. I visited this question and many other pages. Saw something that might help Fixed-point arithmetic and I didn't understand. So the problem is this division is the bottleneck. I tried two ways:
First, casting to float and do the operation with AVX instruction:
//outside the bottleneck:
__m256i veci16; // containing some integer numbers (16x16-bit numbers)
__m256 div_v = _mm256_set1_ps(div);
//inside the bottlneck
//some calculations which make veci16
vecps = _mm256_castsi256_ps (veci16);
vecps = _mm256_div_ps (vecps, div_v);
veci16 = _mm256_castps_si256 (vecps);
_mm256_storeu_si256((__m256i *)&output[i][j], veci16);
With the first method, the problem is: without division elapsed time is 5ns and with this elapsed time is about 60ns.
Second, I stored to an array and loaded it like this:
int t[16] ;
inline __m256i _mm256_div_epi16 (__m256i a , int b){
_mm256_store_si256((__m256i *)&t[0] , a);
t[0]/=b; t[1]/=b; t[2]/=b; t[3]/=b; t[4]/=b; t[5]/=b; t[6]/=b; t[7]/=b;
t[8]/=b; t[9]/=b; t[10]/=b; t[11]/=b; t[12]/=b; t[13]/=b; t[14]/=b; t[15]/=b;
return _mm256_load_si256((__m256i *)&t[0]);
}
Well, it was better. But still elapsed time is 17ns. Calculations are too much to show here.
The question is: Is there any faster way to optimize this inline function?
You can do this with _mm256_mulhrs_epi16. This does a fixed-point multiply, so you just set the multiplicand vector to 32768 / b:
inline __m256i _mm256_div_epi16 (const __m256i va, const int b)
{
__m256i vb = _mm256_set1_epi16(32768 / b);
return _mm256_mulhrs_epi16(va, vb);
}
Note that this assumes b > 1.
Related
Imagine this piece of code:
void Function(int16 *src, int *indices, float *dst, int cnt, float mul)
{
for (int i=0; i<cnt; i++) dst[i] = float(src[indices[i]]) * mul;
};
This really asks for gather intrinsics e.g. _mm_i32gather_epi32. I got great success with these when loading floats, but are there any for 16-bit ints? Another problem here is that I need to transition from 16-bits on the input to 32-bits (float) on the output.
There is indeed no instruction to gather 16bit integers, but (assuming there is no risk of memory-access violation) you can just load 32bit integers starting at the corresponding addresses, and mask out the upper halves of each value.
For uint16_t this would be a simple bit-and, for signed integers you can shift the values to the left in order to have the sign bit at the most-significant position. You can then (arithmetically) shift back the values before converting them to float, or, since you multiply them anyway, just scale the multiplication factor accordingly.
Alternatively, you could load from two bytes earlier and arithmetically shift to the right. Either way, your bottle-neck will likely be the load-ports (vpgatherdd requires 8 load-uops. Together with the load for the indices you have 9 loads distributed on two ports, which should result in 4.5 cycles for 8 elements).
Untested possible AVX2 implementation (does not handle the last elements, if cnt is not a multiple of 8 just execute your original loop at the end):
void Function(int16_t const *src, int const *indices, float *dst, size_t cnt, float mul_)
{
__m256 mul = _mm256_set1_ps(mul_*float(1.0f/0x10000));
for (size_t i=0; i+8<=cnt; i+=8){ // todo handle last elements
// load indicies:
__m256i idx = _mm256_loadu_si256(reinterpret_cast<__m256i const*>(indices + i));
// load 16bit integers in the lower halves + garbage in the upper halves:
__m256i values = _mm256_i32gather_epi32(reinterpret_cast<int const*>(src), idx, 2);
// shift each value to upper half (removes garbage, makes sure sign is at the right place)
// values are too large by a factor of 0x10000
values = _mm256_slli_epi32(values, 16);
// convert to float, scale and multiply:
__m256 fvalues = _mm256_mul_ps(_mm256_cvtepi32_ps(values), mul);
// store result
_mm256_storeu_ps(dst, fvalues);
}
}
Porting this to AVX-512 should be straight-forward.
"...It is very possible for O(N) code to run faster than O(1) code for specific inputs. Big O just describes the rate of increase."
According to my understanding:
O(N) - Time taken for an algorithm to run based on the varying values of input N.
O(1) - Constant time taken for the algorithm to execute irrespective of the size of the input e.g. int val = arr[10000];
Can someone help me understand based on the author's statement?
O(N) code run faster than O(1)?
What are the specific inputs the author is alluding to?
Rate of increase of what?
O(n) constant time can absolutely be faster than O(1) linear time. The reason is that constant-time operations are totally ignored in Big O, which is a measure of how fast an algorithm's complexity increases as input size n increases, and nothing else. It's a measure of growth rate, not running time.
Here's an example:
int constant(int[] arr) {
int total = 0;
for (int i = 0; i < 10000; i++) {
total += arr[0];
}
return total;
}
int linear(int[] arr) {
int total = 0;
for (int i = 0; i < arr.length; i++) {
total += arr[i];
}
return total;
}
In this case, constant does a lot of work, but it's fixed work that will always be the same regardless of how large arr is. linear, on the other hand, appears to have few operations, but those operations are dependent on the size of arr.
In other words, as arr increases in length, constant's performance stays the same, but linear's running time increases linearly in proportion to its argument array's size.
Call the two functions with a single-item array like
constant(new int[] {1});
linear(new int[] {1});
and it's clear that constant runs slower than linear.
But call them like:
int[] arr = new int[10000000];
constant(arr);
linear(arr);
Which runs slower?
After you've thought about it, run the code given various inputs of n and compare the results.
Just to show that this phenomenon of run time != Big O isn't just for constant-time functions, consider:
void exponential(int n) throws InterruptedException {
for (int i = 0; i < Math.pow(2, n); i++) {
Thread.sleep(1);
}
}
void linear(int n) throws InterruptedException {
for (int i = 0; i < n; i++) {
Thread.sleep(10);
}
}
Exercise (using pen and paper): up to which n does exponential run faster than linear?
Consider the following scenario:
Op1) Given an array of length n where n>=10, print the first ten elements twice on the console. --> This is a constant time (O(1)) operation, because for any array of size>=10, it will execute 20 steps.
Op2) Given an array of length n where n>=10, find the largest element in the array. This is a constant time (O(N)) operation, because for any array, it will execute N steps.
Now if the array size is between 10 and 20 (exclusive), Op1 will be slower than Op2. But let's say, we take an array of size>20 (for eg, size =1000), Op1 will still take 20 steps to complete, but Op2 will take 1000 steps to complete.
That's why the big-o notation is about growth(rate of increase) of an algorithm's complexity
I have two NSDecimalNumbers and I need to apply one to the power of the other, originally this code was using doubles and I could compute this with the pow() function like this:
double result = pow(value1, value2);
The problem I have is I am converting the code to use NSDecimalNumbers and although they include the method toThePowerOf, it only accepts int values. At the moment the only solution I have to this problem is to convert the NSDecimalNumbers Temporarily but this results in a loss of precision.
double value1 = [decimal1 doubleValue];
double value2 = [decimal2 doubleValue];
double result = pow(value1, value2);
NSDecimalNumber *decimalResult = [[NSDecimalNumber alloc] initWithDouble:result];
Is there a way I can do this computation with NSDecimalNumbers and not lose the precision?
I need this to work with non integer values for example:
value1 = 1.06
value2 = 0.0277777777
As Joe points out, if you want to do this for positive integer powers, you can use NSDecimalPower() on an NSDecimal struct derived from your NSDecimalNumber (I personally prefer working with the structs, for performance reasons).
For the more general case of working with negative integers and fractional values, I have some code that I've modified from Dave DeLong's DDMathParser library. He has since removed the NSDecimal portion of this library, but you can find the last commit for this support. I extended Dave's exponential support into the following function:
extern NSDecimal DDDecimalPower(NSDecimal d, NSDecimal power) {
NSDecimal r = DDDecimalOne();
NSDecimal zero = DDDecimalZero();
NSComparisonResult compareToZero = NSDecimalCompare(&zero, &power);
if (compareToZero == NSOrderedSame) {
return r;
}
if (DDDecimalIsInteger(power))
{
if (compareToZero == NSOrderedAscending)
{
// we can only use the NSDecimal function for positive integers
NSUInteger p = DDUIntegerFromDecimal(power);
NSDecimalPower(&r, &d, p, NSRoundBankers);
}
else
{
// For negative integers, we can take the inverse of the positive root
NSUInteger p = DDUIntegerFromDecimal(power);
p = -p;
NSDecimalPower(&r, &d, p, NSRoundBankers);
r = DDDecimalInverse(r);
}
} else {
// Check whether this is the inverse of an integer
NSDecimal inversePower = DDDecimalInverse(power);
NSDecimalRound(&inversePower, &inversePower, 34, NSRoundBankers); // Round to 34 digits to deal with cases like 1/3
if (DDDecimalIsInteger(inversePower))
{
r = DDDecimalNthRoot(d, inversePower);
}
else
{
double base = DDDoubleFromDecimal(d);
double p = DDDoubleFromDecimal(power);
double result = pow(base, p);
r = DDDecimalFromDouble(result);
}
}
return r;
}
This runs exact calculations on positive integer powers, negative integer powers, and fractional powers that map directly to roots. It still falls back on floating point calculations for fractional powers that don't cleanly fall into one of those bins, though.
Unfortunately, this requires a few of his other supporting functions to work. Therefore, I've uploaded my enhanced versions of his _DDDecimalFunctions.h and _DDDecimalFunctions.m that provide this functionality. They also include NSDecimal trigonometry, logarithm, and a few other functions. There are currently some issues with convergence on the tangent implementation, which is why I haven't finished a public post about this.
I came across the same problem recently and developed my own function to do exactly this. The function has will calculate any base to any power as long as it yields a real answer if it determines a real answer cannot be calculated it returns NSDecimalnumber.notANumber
I have posted my solution as an answer to the same question that I posted so here is the link.
In Xcode /Objective-C for the iPhone.
I have a float with the value 0.00004876544. How would I get it to display to two decimal places after the first significant number?
For example, 0.00004876544 would read 0.000049.
I didn't run this through a compiler to double-check it, but here's the basic jist of the algorithm (converted from the answer to this question):
-(float) round:(float)num toSignificantFigures:(int)n {
if(num == 0) {
return 0;
}
double d = ceil(log10(num < 0 ? -num: num));
int power = n - (int) d;
double magnitude = pow(10, power);
long shifted = round(num*magnitude);
return shifted/magnitude;
}
The important thing to remember is that Objective-C is a superset of C, so anything that is valid in C is also valid in Objective-C. This method uses C functions defined in math.h.
I want to vectorize code for Core2. I think, I can use intrinsic functions from gcc or icc, and the SSE, SSE2, SSE3, SSSE3 instructions are allowed.
My code works on arrays of 8 uint32_t elements and it is like this (only hotspot is here):
const uint32_t p[8] = {2147483743, 2147483713, 2147483693, 2147483659,
2147483647, 2147483629, 2147483587, 2147483579};
void vector_mod_add(uint32_t *a /* a[8] */, uint32_t *b /* b[8] */) {
int n;
for(n=0;n<8;n++)
a[n]+=b[n];
for(n=0;n<8;n++)
if(a[n]>=p[n])
a[n]-=p[n];
}
Addition is rather easy, but I don't know how it is possible to do an conditional subtraction.
Also, I have no experience in manual vectorizing with SSE2, so, please, tell me how should I define all types here.
You can write it as a[n] -= p[n] & ~(a[n] < p[n]). Note that the < here is not the C one, it's the SSE one (pcmpltd) that returns -1 in each true element and 0 in each false element (to allow the AND operation), and &~ is pandn. Here is an attempt at the code:
__m128i a, p;
a = _mm_sub_epi32(a, _mm_andnot_si128(_mm_cmplt_epi32(a, p), p));
Note that this uses signed operations, and so your numbers will need to stay below 2^31 - 1 for it to work correctly. If you need to go beyond that, change _mm_cmplt_epi32(a, p) to _mm_cmplt_epi32(_mm_xor_si128(a, signs), _mm_xor_si128(p, signs)), where signs is a vector of 32-bit words whose elements are all 0x80000000. Here is a version that seems like it will handle wider ranges more efficiently:
__m128i a, p;
a = _mm_sub_epi32(a, p);
a = _mm_add_epi32(a, _mm_and_si128(_mm_srai_epi32(a, 31), p));