Are there CPUs that do optimize integer divisions like compilers do? - optimization

As title says, optimizing compilers transform division by constant (non-pow 2) into reciprocal multiplication to avoid time expensive div instruction. I'm curious whether there is a CPU that does this optimization as code is being executed, caching frequently used divisor's 'magical constants' in a cache table and use them for computing div, if divisor operand is present in cache.
I've run some experiment on x64 (C++) and M1 (Rust, below), repeatedly dividing a number by same divisor and dividing by ever-changing divisor in loop. Both ended up being identical in performance comparison, I guess answer is no at least for these CPUs I tested on.
type div32 = i32;
#[inline(never)]
fn div_bench_fn(dividend: div32, mut divisor: div32, mut step: div32) -> f64 {
step = std::hint::black_box(step);
const LOOP_COUNT: u32 = 100_000_000;
let start = std::time::Instant::now();
let mut tmp = 0;
for _ in 0..LOOP_COUNT {
let res = dividend % divisor;
divisor += step;
tmp ^= res;
std::hint::black_box(tmp);
}
let elapsed = start.elapsed().as_secs_f64();
let div_sec = (LOOP_COUNT as f64) / elapsed;
div_sec / 1_000_000.
}
fn main() {
for _ in 0..10 {
println!("--");
println!(" static {:.2}m div/sec", div_bench_fn(0x12345678, 127, 0));
println!(" dynamic {:.2}m div/sec", div_bench_fn(0x12345678, 127, 1));
}
}
Looping code (ARMv8.5-A)
divbench[0x1000017e8] <+64>: cbz w9, 0x10000185c ; <+180> // panic if divisor == 0
divbench[0x1000017ec] <+68>: sdiv w13, w11, w9
divbench[0x1000017f0] <+72>: msub w13, w13, w9, w11
divbench[0x1000017f4] <+76>: add w9, w9, w19
divbench[0x1000017f8] <+80>: eor w8, w13, w8
divbench[0x1000017fc] <+84>: str w8, [sp, #0xc]
divbench[0x100001800] <+88>: subs w10, w10, #0x1
divbench[0x100001804] <+92>: b.ne 0x1000017e8 ; <+64>
Results (higher is better in this one)
--
static 1018.66m div/sec
dynamic 1568.52m div/sec
--
static 1574.76m div/sec
dynamic 1574.63m div/sec
--
static 1575.77m div/sec
dynamic 1574.38m div/sec
--
static 1577.74m div/sec
dynamic 1581.09m div/sec
--
static 1585.35m div/sec
dynamic 1591.46m div/sec
--
static 1572.47m div/sec
dynamic 1585.13m div/sec
--
static 1560.05m div/sec
dynamic 1571.18m div/sec
--
static 1575.05m div/sec
dynamic 1550.98m div/sec
--
static 1556.27m div/sec
dynamic 1571.85m div/sec
--
static 1581.93m div/sec
dynamic 1579.43m div/sec
Probably I'm not the only one to think this but I haven't been able to find information on this.

Related

Can we have dirty data on l1 cache in gpu?

I've read some of the common write policies in the microarchitecture of GPUs. For most of the GPU the written policy is the same as the below picture (the picture is from the gpgpu-sim manual). based on the below picture I have a question. can we have dirty data on the l1 cache?
The L1 on some GPU architectures is a write-back cache for global accesses. Note that this topic varies by GPU architecture, e.g. for whether global activity is cached in L1.
Speaking generally, then, yes you can have dirty data. By this I mean that the data in the L1 cache is modified (compared to what is otherwise in global space or the L2 cache) and it has not yet been "flushed" or updated into the L2 cache. (You can also have "stale" data - data in the L1 that has not been modified, but is not consistent with the L2.)
We can create a simple proof point for this (dirty data).
The following code, when executed on a cc7.0 device (and probably some other archtectures as well) will not give the expected answer of 1024.
This is due to the fact that the L1, which is a separate entity per SM, is not immediately flushed to the L2. It therefore has "dirty data" by the above definition.
(The code is broken for this reason. Don't use this code. It's just a proof point.)
#include <iostream>
#include <cuda_runtime.h>
constexpr int num_blocks = 1024;
constexpr int num_threads = 32;
struct Lock {
int *locked;
Lock() {
int init = 0;
cudaMalloc(&locked, sizeof(int));
cudaMemcpy(locked, &init, sizeof(int), cudaMemcpyHostToDevice);
}
~Lock() {
if (locked) cudaFree(locked);
locked = NULL;
}
__device__ __forceinline__ void acquire_lock() {
while (atomicCAS(locked, 0, 1) != 0);
}
__device__ __forceinline__ void unlock() {
atomicExch(locked, 0);
}
};
__global__ void counter(Lock lock, int *total) {
if (threadIdx.x == 1) {
lock.acquire_lock();
*total = *total + 1;
// __threadfence(); uncomment this line to fix
lock.unlock();
}
}
int main() {
int *total_dev;
cudaMalloc(&total_dev, sizeof(int));
int total_host = 0;
cudaMemcpy(total_dev, &total_host, sizeof(int), cudaMemcpyHostToDevice);
{
Lock lock;
counter<<<num_blocks, num_threads>>>(lock, total_dev);
cudaDeviceSynchronize();
cudaMemcpy(&total_host, total_dev, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << total_host << std::endl;
}
cudaFree(total_dev);
}
In case there is any further doubt about whether this is a proper proof (e.g. to dispel arguments about things being "optimized into a register" etc.) we can study the resultant sass code. The end of the above kernel has code that looks like this:
/*0130*/ LDG.E.SYS R0, [R4] ; /* 0x0000000004007381 */
// load *total /* 0x000ea400001ee900 */
/*0140*/ IADD3 R7, R0, 0x1, RZ ; /* 0x0000000100077810 */
// add 1 /* 0x004fd00007ffe0ff */
/*0150*/ STG.E.SYS [R4], R7 ; /* 0x0000000704007386 */
// store *total /* 0x000fe8000010e900 */
/*0160*/ ATOMG.E.EXCH.STRONG.GPU PT, RZ, [R2], RZ ; /* 0x000000ff02ff73a8 */
//lock.unlock /* 0x000fe200041f41ff */
/*0170*/ EXIT ;
Since the result register has definitely been stored to the global space, we can infer that if another thread (in another SM) reads an unexpected value in global space for *total it must be due to the fact that the store from another SM has not reached the L2, i.e. has not reached device-wide consistency/coherency. Therefore the data in some other SM is "dirty". We can (presumably) rule out the "stale" case here (the data in the other L1 was written, but I have "old" data in my L1) because the global load indicated above does not happen until the lock is acquired in the SM.
Note that the above code "fails" on cc7.0 devices (and probably some other device architectures). It does not necessarily fail on the GPU you are using. But it is still "broken".

Debug data/neon performance hazards in arm neon code

Originally the problem appeared when I tried to optimize an algorithm for neon arm and some minor part of it was taking 80% of according to profiler. I tried to test to see what can be done to improve it and for that I created array of function pointers to different versions of my optimized function and then I run them in the loop to see in profiler which one performs better:
typedef unsigned(*CalcMaxFunc)(const uint16_t a[8][4], const uint16_t b[4][4]);
CalcMaxFunc CalcMaxFuncs[] =
{
CalcMaxFunc_NEON_0,
CalcMaxFunc_NEON_1,
CalcMaxFunc_NEON_2,
CalcMaxFunc_NEON_3,
CalcMaxFunc_C_0
};
int N = sizeof(CalcMaxFunc) / sizeof(CalcMaxFunc[0]);
for (int i = 0; i < 10 * N; ++i)
{
auto f = CalcMaxFunc[i % N];
unsigned retI = f(a, b);
// just random code to ensure that cpu waits for the results
// and compiler doesn't optimize it away
if (retI > 1000000)
break;
ret |= retI;
}
I got surprising results: performance of a function was totally depend on its position within CalcMaxFuncs array. For example, when I swapped CalcMaxFunc_NEON_3 to be first it would be 3-4 times slower and according to profiler it would stall at the last bit of the function where it tried to move data from neon to arm register.
So, what does it make stall sometimes and not in other times? BY the way, I profile on iPhone6 in xcode if that matters.
When I intentionally introduced neon pipeline stalls by mixing-in some floating point division between calling these functions in the loop I eliminated unreliable behavior, now all of them perform the same regardless of the order in which they were called. So, why in the first place did I have that problem and what can I do to eliminate it in actual code?
Update:
I tried to create a simple test function and then optimize it in stages and see how I could possibly avoid neon->arm stalls.
Here's the test runner function:
void NeonStallTest()
{
int findMinErr(uint8_t* var1, uint8_t* var2, int size);
srand(0);
uint8_t var1[1280];
uint8_t var2[1280];
for (int i = 0; i < sizeof(var1); ++i)
{
var1[i] = rand();
var2[i] = rand();
}
#if 0 // early exit?
for (int i = 0; i < 16; ++i)
var1[i] = var2[i];
#endif
int ret = 0;
for (int i=0; i<10000000; ++i)
ret += findMinErr(var1, var2, sizeof(var1));
exit(ret);
}
And findMinErr is this:
int findMinErr(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err = 0;
for (int j = 0; j < 16; ++j)
{
int x = var1[j] - var2[j];
err += x * x;
}
if (ret_err > err)
{
ret_err = err;
ret = i;
}
}
return ret;
}
Basically it it does sum of squared difference between each uint8_t[16] block and returns index of the block pair that has lowest squared difference. So, then I rewrote it in neon intrisics (no particular attempt was made to make it fast, as it's not the point):
int findMinErr_NEON(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err;
uint8x8_t var1_0 = vld1_u8(var1 + 0);
uint8x8_t var1_1 = vld1_u8(var1 + 8);
uint8x8_t var2_0 = vld1_u8(var2 + 0);
uint8x8_t var2_1 = vld1_u8(var2 + 8);
int16x8_t s0 = vreinterpretq_s16_u16(vsubl_u8(var1_0, var2_0));
int16x8_t s1 = vreinterpretq_s16_u16(vsubl_u8(var1_1, var2_1));
uint16x8_t u0 = vreinterpretq_u16_s16(vmulq_s16(s0, s0));
uint16x8_t u1 = vreinterpretq_u16_s16(vmulq_s16(s1, s1));
#ifdef __aarch64__1
err = vaddlvq_u16(u0) + vaddlvq_u16(u1);
#else
uint32x4_t err0 = vpaddlq_u16(u0);
uint32x4_t err1 = vpaddlq_u16(u1);
err0 = vaddq_u32(err0, err1);
uint32x2_t err00 = vpadd_u32(vget_low_u32(err0), vget_high_u32(err0));
err00 = vpadd_u32(err00, err00);
err = vget_lane_u32(err00, 0);
#endif
if (ret_err > err)
{
ret_err = err;
ret = i;
#if 0 // enable early exit?
if (ret_err == 0)
break;
#endif
}
}
return ret;
}
Now, if (ret_err > err) is clearly data hazard. Then I manually "unrolled" loop by two and modified code to use err0 and err1 and check them after performing next round of compute. According to profiler I got some improvements. In simple neon loop I got roughly 30% of entire function spent in the two lines vget_lane_u32 followed by if (ret_err > err). After I unrolled by two these operations started to take 25% (e.g. I got roughly 10% overall speedup). Also, check armv7 version, there is only 8 instructions between when err0 is set (vmov.32 r6, d16[0]) and when it's accessed (cmp r12, r6). T
Note, in the code early exit is ifdefed out. Enabling it would make function even slower. If I unrolled it by four and changed to use four errN variables and deffer check by two rounds then I still saw vget_lane_u32 in profiler taking too much time. When I checked generated asm, appears that compiler destroys all the optimizations attempts because it reuses some of the errN registers which effectively makes CPU access results of vget_lane_u32 much earlier than I want (and I aim to delay access by 10-20 instructions). Only when I unrolled by 4 and marked all four errN as volatile vget_lane_u32 totally disappeared from the radar in profiler, however, the if (ret_err > errN) check obviously got slow as hell as now these probably ended up as regular stack variables overall these 4 checks in 4x manual loop unroll started to take 40%. Looks like with proper manual asm it's possible to make it work properly: have early loop exit, while avoiding neon->arm stalls and have some arm logic in the loop, however, extra maintenance required to deal with arm asm makes it 10x more complex to maintain that kind of code in a large project (that doesn't have any armasm).
Update:
Here's sample stall when moving data from neon to arm register. To implement early exist I need to move from neon to arm once per loop. This move alone takes more than 50% of entire function according to sampling profiler that comes with xcode. I tried to add lots of noops before and/or after the mov, but nothing seems to affect results in profiler. I tried to use vorr d0,d0,d0 for noops: no difference. What's the reason for the stall, or the profiler simply shows wrong results?

An optimized implementation of the Heaviside function

I'm would like to (super)optimize an implementation of the Heaviside function.
I'm working on a numerical algorithm (in Fortran) where speed is particularly important. This employs the Heaviside function many times, currently implemented by the signum intrinsic function as follows:
heaviside = 0.5*sign(1,x)+1
I'm mainly interested in the case where x is a double precision real number on intel processors.
Is it possible to develop a more efficient implementation of the Heaviside function?
Perhaps using assembly language, a superoptimizing code or call to an existing external library?
Did you intend heaviside = 0.5*(sign(1,x)+1)? In any case testing with gcc 4.8.1 fortran shows High Performance Mark's idea should be beneficial. Here are 3 possibilities:
heaviside1 - original
heaviside2 - High Performance Mark's idea
heaviside3 - another variation
function heaviside1 (x)
double precision heaviside1, x
heaviside1 = 0.5 * (sign(1d0,x) + 1)
end
function heaviside2 (x)
double precision heaviside2, x
heaviside2 = sign(0.5d0,x) + 0.5
end
function heaviside3 (x)
double precision heaviside3, x
heaviside3 = 0
if (x .ge. 0) heaviside3 = 1
end
program demo
double precision heaviside1, heaviside2, heaviside3, x, a, b, c
do
x = 0.5 - RAND(0)
a = heaviside1(x)
b = heaviside2(x)
c = heaviside3(x)
print *, "x=", x, "heaviside(x)=", a, b, c
enddo
end
When compiled, gcc generates these 3 stand-alone functions:
<heaviside1_>:
vmovsd xmm0,QWORD PTR [rcx]
vandpd xmm0,xmm0,XMMWORD PTR [rip+0x2d824]
vorpd xmm0,xmm0,XMMWORD PTR [rip+0x2d80c]
vaddsd xmm0,xmm0,QWORD PTR [rip+0x2d7f4]
vmulsd xmm0,xmm0,QWORD PTR [rip+0x2d81c]
ret
<heaviside2_>:
vmovsd xmm0,QWORD PTR [rcx]
vandpd xmm0,xmm0,XMMWORD PTR [rip+0x2d844]
vorpd xmm0,xmm0,XMMWORD PTR [rip+0x2d85c]
vaddsd xmm0,xmm0,QWORD PTR [rip+0x2d844]
ret
<heaviside3_>:
vxorpd xmm0,xmm0,xmm0
vmovsd xmm1,QWORD PTR [rip+0x2d844]
vcmplesd xmm0,xmm0,QWORD PTR [rcx]
vandpd xmm0,xmm1,xmm0
ret
When compiled with gcc, heaviside1 generates a multiply that might slow execution.
heaviside2 eliminates the multiply.
heaviside3 has the same number of instructions as heaviside2, but uses 2 fewer memory accesses.
For the stand-alone functions:
instruction memory reference
count count
heaviside1 6 5
heaviside2 5 4
heaviside3 5 2
The inline code for these functions avoids the need for the return instruction and ideally passes the arguments in registers and preloads other registers with needed constants. The exact result depends on the compiler used and the calling code. An estimate for inlined code:
instruction memory reference
count count
heaviside1 4 0
heaviside2 3 0
heaviside3 2 0
It looks like the function could be handled by as few as two compiler generated instructions: vcmplesd+vandpd. The first instruction creates a mask of all zeros if the argument is negative, or a mask of all ones otherwise. The second instruction applies the mask to a register constant value of one in order to produce the result value of zero or one.
Though I have not benchmarked these functions, it looks like the heaviside function should not take much execution time.
---09/23/2013: adding x86_64 assembly language versions and C language benchmark---
file functions.s
//----------------------------------------------------------------------------
.intel_syntax noprefix
.text
//-----------------------------------------------------------------------------
// this heaviside function generates its own register constants
// double heaviside_a1 (double arg);
.globl heaviside_a1
heaviside_a1:
mov rax,0x3ff0000000000000
xorpd xmm1,xmm1 # xmm1: constant 0.0
cmplesd xmm1,xmm0 # xmm1: mask (all Fs or all 0s)
movq xmm0,rax # xmm0: constant 1.0
andpd xmm0,xmm1
retq
//-----------------------------------------------------------------------------
// this heaviside function uses register constants passed from caller
// double heaviside_a2 (double arg, double const0, double const1);
.globl heaviside_a2
heaviside_a2:
cmplesd xmm1,xmm0 # xmm1: mask (all Fs or all 0s)
movsd xmm0,xmm2 # xmm0: constant 1.0
andpd xmm0,xmm1
retq
//-----------------------------------------------------------------------------
file ctest.c
#define __USE_MINGW_ANSI_STDIO 1
#include <windows.h>
#include <stdio.h>
#include <stdint.h>
// functions.s
double heaviside_a1 (double x);
double heaviside_a2 (double arg, double const0, double const1);
//-----------------------------------------------------------------------------
static double heaviside_c1 (double x)
{
double result = 0;
if (x >= 0) result = 1;
return result;
}
//-----------------------------------------------------------------------------
//
// queryPerformanceCounter - similar to QueryPerformanceCounter, but returns
// count directly.
uint64_t queryPerformanceCounter (void)
{
LARGE_INTEGER int64;
QueryPerformanceCounter (&int64);
return int64.QuadPart;
}
//-----------------------------------------------------------------------------
//
// queryPerformanceFrequency - same as QueryPerformanceFrequency, but returns count direcly.
uint64_t queryPerformanceFrequency (void)
{
LARGE_INTEGER int64;
QueryPerformanceFrequency (&int64);
return int64.QuadPart;
}
//----------------------------------------------------------------------------
//
// lfsr64gpr - left shift galois type lfsr for 64-bit data, general purpose register implementation
//
static uint64_t lfsr64gpr (uint64_t data, uint64_t mask)
{
uint64_t carryOut = data >> 63;
uint64_t maskOrZ = -carryOut;
return (data << 1) ^ (maskOrZ & mask);
}
//---------------------------------------------------------------------------
int runtests (uint64_t pattern, uint64_t mask)
{
uint64_t startCount, elapsed, index, loops = 800000000;
double ns;
double total = 0;
startCount = queryPerformanceCounter ();
for (index = 0; index < loops; index++)
{
double x, result;
pattern = lfsr64gpr (pattern, mask);
x = (double) (int64_t) pattern;
result = heaviside_c1 (x);
total += result;
}
elapsed = queryPerformanceCounter () - startCount;
ns = (double) elapsed / queryPerformanceFrequency () * 1000000000 / loops;
printf ("heaviside_c1: %7.2f ns\n", ns);
startCount = queryPerformanceCounter ();
for (index = 0; index < loops; index++)
{
double x, result;
pattern = lfsr64gpr (pattern, mask);
x = (double) (int64_t) pattern;
result = heaviside_a1 (x);
//printf ("heaviside_a1 (%lf): %lf\n", x, result);
total += result;
}
elapsed = queryPerformanceCounter () - startCount;
ns = (double) elapsed / queryPerformanceFrequency () * 1000000000 / loops;
printf ("heaviside_a1: %7.2f ns\n", ns);
startCount = queryPerformanceCounter ();
for (index = 0; index < loops; index++)
{
double x, result;
const double const0 = 0.0;
const double const1 = 1.0;
pattern = lfsr64gpr (pattern, mask);
x = (double) (int64_t) pattern;
result = heaviside_a2 (x, const0, const1);
//printf ("heaviside_a2 (%lf): %lf\n", x, result);
total += result;
}
elapsed = queryPerformanceCounter () - startCount;
ns = (double) elapsed / queryPerformanceFrequency () * 1000000000 / loops;
printf ("heaviside_a2: %7.2f ns\n", ns);
return total;
}
//---------------------------------------------------------------------------
int main (void)
{
uint64_t mask;
mask = 0xBEFFFFFFFFFFFFFF;
// raise our priority to increase measurement accuracy
SetPriorityClass (GetCurrentProcess (), REALTIME_PRIORITY_CLASS);
printf ("using pseudo-random data\n");
runtests (1, mask);
return 0;
}
//---------------------------------------------------------------------------
mingw64 build command: gcc -Wall -Wextra -O3 -octest.exe ctest.c functions.s
Program output from Intel Core i7-2600K at 4.0 GHz:
using pseudo-random data
heaviside_c1: 2.24 ns
heaviside_a1: 2.00 ns
heaviside_a2: 2.00 ns
These timing results include execution of pseudo-random argument generation and result totalization code needed to keep the optimizer from eliminating the otherwise unused heaviside_c1 local function.
heaviside_c1 is from the original fortran suggestion, ported to C.
heaviside_a1 is an assembly language implementation.
heaviside_a2 is a modification of the assembly language version that uses register constants passed by the caller to avoid the overhead of generating them. For my processor, benchmarking shows no advantage to passing constants.
The assembly language functions assume xmm0 returns the result and xmm1 and xmm2 are available as scratch registers. This is valid for the x64 calling convention used by Windows. This assumption should be confirmed for other calling conventions.
In order to avoid memory accesses, the assembly language version expects the argument to be passed by value in a register (XMM0). Because this is not the fortran default, a special declaration is required. This one seems to work properly for gfortran 64-bit:
interface
real(c_double) function heaviside_a1(x)
use iso_c_binding, only: c_double
real(c_double), VALUE :: x
end function heaviside_a1
end interface

Fastest way to test a 128 bit NEON register for a value of 0 using intrinsics?

I'm looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics.
I'm currently using 3 OR operations, and 2 MOVs:
uint32x4_t vr = vorrq_u32(vcmp0, vcmp1);
uint64x2_t v0 = vreinterpretq_u64_u32(vr);
uint64x1_t v0or = vorr_u64(vget_high_u64(v0), vget_low_u64(v0));
uint32x2_t v1 = vreinterpret_u32_u64 (v0or);
uint32_t r = vget_lane_u32(v1, 0) | vget_lane_u32(v1, 1);
if (r == 0) { // do stuff }
This translates by gcc to the following assembly code:
VORR q9, q9, q10
VORR d16, d18, d19
VMOV.32 r3, d16[0]
VMOV.32 r2, d16[1]
VORRS r2, r2, r3
BEQ ...
Does anyone have an idea of a faster way?
While this answer may be a bit late, there is a simple way to do the test with only 3 instructions and no extra registers:
inline uint32_t is_not_zero(uint32x4_t v)
{
uint32x2_t tmp = vorr_u32(vget_low_u32(v), vget_high_u32(v));
return vget_lane_u32(vpmax_u32(tmp, tmp), 0);
}
The return value will be nonzero if any bit in the 128-bit NEON register was set.
If you're targeting AArch64 NEON, you can use the following to get a value to test with just two instructions:
inline uint64_t is_not_zero(uint32x4_t v)
{
uint64x2_t v64 = vreinterpretq_u64_u32(v);
uint32x2_t v32 = vqmovn_u64(v64);
uint64x1_t result = vreinterpret_u64_u32(v32);
return result[0];
}
You seem to be looking for intrinsics and this is the way:
inline bool is_zero(int32x4_t v) noexcept
{
v = v == int32x4{};
return !int32x2_t(
vtbl2_s8(
int8x8x2_t{
int8x8_t(vget_low_s32(v)),
int8x8_t(vget_high_s32(v))
},
int8x8_t{0, 4, 8, 12}
)
)[0];
}
Nils Pipenbrinck's answer has a flaw in that he assumes the QC, cumulative saturation flag to be clear.
If you have AArch64 you can do it even easier. They have a new instruction for designed for this.
inline uint32_t is_not_zero(uint32x4_t v)
{
return vaddvq_u32(v);
}
I'd avoid functions returning integer values that should only be interpreted as bool. A better way would be, for instance, defining a helper function to return maximum unsigned value of 4 lanes:
inline uint32_t max_lane_value_u32(const uint32x4_t& v)
{
#if defined(_WIN32) && defined(_ARM64_)
// Windows 64-bit
return neon_umaxvq32(v);
#elif defined(__LP64__)
// Linux/Android 64-bit
return vmaxvq_u32(v);
#else
// Windows/Linux/Android 32-bit
uint32x2_t result = vmax_u32(vget_low_u32(v), vget_high_u32(v));
return vget_lane_u32(vpmax_u32(result, result), 0);
#endif
}
you can then use:
if (0 == max_lane_value_u32(v))
{
...
}
in your code, and such function might also be useful elsewhere. Alternatively, you can use the exact same code to write a is_not_zero() function, but then it's best form to return a bool.
Note that the only reason you'd need to define a helper function is because vmaxvq_u32() is not available on 32-bit, and may not be aliased from neon_umaxvq32() in arm64_neon.h on Windows.

Convert (vectorize) code with per32-bit element conditional to SSE2 SSE3

I want to vectorize code for Core2. I think, I can use intrinsic functions from gcc or icc, and the SSE, SSE2, SSE3, SSSE3 instructions are allowed.
My code works on arrays of 8 uint32_t elements and it is like this (only hotspot is here):
const uint32_t p[8] = {2147483743, 2147483713, 2147483693, 2147483659,
2147483647, 2147483629, 2147483587, 2147483579};
void vector_mod_add(uint32_t *a /* a[8] */, uint32_t *b /* b[8] */) {
int n;
for(n=0;n<8;n++)
a[n]+=b[n];
for(n=0;n<8;n++)
if(a[n]>=p[n])
a[n]-=p[n];
}
Addition is rather easy, but I don't know how it is possible to do an conditional subtraction.
Also, I have no experience in manual vectorizing with SSE2, so, please, tell me how should I define all types here.
You can write it as a[n] -= p[n] & ~(a[n] < p[n]). Note that the < here is not the C one, it's the SSE one (pcmpltd) that returns -1 in each true element and 0 in each false element (to allow the AND operation), and &~ is pandn. Here is an attempt at the code:
__m128i a, p;
a = _mm_sub_epi32(a, _mm_andnot_si128(_mm_cmplt_epi32(a, p), p));
Note that this uses signed operations, and so your numbers will need to stay below 2^31 - 1 for it to work correctly. If you need to go beyond that, change _mm_cmplt_epi32(a, p) to _mm_cmplt_epi32(_mm_xor_si128(a, signs), _mm_xor_si128(p, signs)), where signs is a vector of 32-bit words whose elements are all 0x80000000. Here is a version that seems like it will handle wider ranges more efficiently:
__m128i a, p;
a = _mm_sub_epi32(a, p);
a = _mm_add_epi32(a, _mm_and_si128(_mm_srai_epi32(a, 31), p));