how to vectorize a[i] = a[i-1] +c with AVX2 - optimization

I want to vectorize a[i] = a[i-1] +c by AVX2 instructions. It seems its un vectorizable because of the dependencies. I've vectorized and want to share the answer here to see if there is any better answer to this question or my solution is good.

I have implemented the following function for vectorizing this and it seems OK! The speedup is 2.5x over gcc -O3
Here is the solution:
// vectorized
inline void vec(int a[LEN], int b, int c)
{
// b=1 and c=2 in this case
int i = 0;
a[i++] = b;//0 --> a[0] = 1
//step 1:
//solving dependencies vectorization factor is 8
a[i++] = a[0] + 1*c; //1 --> a[1] = 1 + 2 = 3
a[i++] = a[0] + 2*c; //2 --> a[2] = 1 + 4 = 5
a[i++] = a[0] + 3*c; //3 --> a[3] = 1 + 6 = 7
a[i++] = a[0] + 4*c; //4 --> a[4] = 1 + 8 = 9
a[i++] = a[0] + 5*c; //5 --> a[5] = 1 + 10 = 11
a[i++] = a[0] + 6*c; //6 --> a[6] = 1 + 12 = 13
a[i++] = a[0] + 7*c; //7 --> a[7] = 1 + 14 = 15
// vectorization factor reached
// 8 *c will work for all
//loading the results to an vector
__m256i dep1, dep2; // dep = { 1, 3, 5, 7, 9, 11, 13, 15 }
__m256i coeff = _mm256_set1_epi32(8*c); //coeff = { 16, 16, 16, 16, 16, 16, 16, 16 }
for(; i<LEN-1; i+=16){
dep1 = _mm256_load_si256((__m256i *) &a[i-8]);
dep1 = _mm256_add_epi32(dep1, coeff);
_mm256_store_si256((__m256i *) &a[i], dep1);
dep2 = _mm256_load_si256((__m256i *) &a[i]);
dep2 = _mm256_add_epi32(dep2, coeff);
_mm256_store_si256((__m256i *) &a[i+8], dep2);
}
}

Related

Given an array A[], determine the minimum value of expression: min( abs( a[ i ] - x ) , abs( a[ i ] - y ) )

Given an array A of N integers. Find two integers x and y such that the sum of the absolute difference between each array element to one of the two chosen integers is minimal.
Determine the minimum value of the expression
Σ(i=0 to n) min( abs(a[i] - x), abs(a[i] - y) )
Example 1:
N = 4
A = [2,3,6,7]
Approach
You can choose the two integers, 3 and 7
The required sum = |2-3| + |3-3| + |6-7| + |7-7| = 1+0+0+1 = 2
Example 2:
N = 8
A = [ 2, 3, 5, 8, 11, 14, 17, 996 ]
Approach
You can choose the two integers, 8 and 996
The required sum = |2-8| + |3-8| + |5-8| + |8-8| + |11-8| + |14-8| + |17-8| + |996-996| = 6+5+3+0+3+6+9+0 = 32
Constraints
1<=T<=100
2<=N<=5*10^3
1<=A[i]<=10^5
The sum of N over all test cases does not exceed 5*10^3.
**My code : **
Please help me with the optimal solution O(N) or O(NlogN)
int minAbsDiff(vector<int> Arr, int N)
{
// Approach: O(N^3)
sort(Arr.begin(), Arr.end());
int sum = INT_MAX;
for (int i = 0; i < N; i++)
for (int j = i + 1; j < N; j++)
{
int tmp_sum = 0;
for (int k = 0; k < N; k++)
{
tmp_sum += min(abs(Arr[k] - Arr[i]), abs(Arr[k] - Arr[j]));
}
sum = min(sum, tmp_sum);
}
std::cout << "Sum is :" << sum << std::endl;
return sum;
}

Algorithm for computing inverse polynomials in NTRUEncrypt

I'm implementing in code an algorithm for computing inverse polynomials in the NTRU cryptosystem, and I'm using the paper "Almost Inverses and Fast NTRU Key Creation" by Joseph H. Silverman. I implemented the second pseudo-code as:
int inverse_mod_p(polynomial *r, polynomial *a)
{
int k;
int16_t b[NTRU_N + 1], c[NTRU_N + 1], f[NTRU_N + 1], g[NTRU_N + 1];
int i;
int16_t aux;
int zero_f;
int constant_f;
int deg_fg;
memset(b, 0, (NTRU_N + 1) * sizeof(int16_t));
b[0] = 1;
memset(c, 0, (NTRU_N + 1) * sizeof(int16_t));
memcpy(f, a->coeffs, NTRU_N * sizeof(int16_t));
f[NTRU_N] = 0;
memset(g, 0, (NTRU_N + 1) * sizeof(int16_t));
g[0] = -1;
g[NTRU_N] = 1;
while (1)
{
zero_f = 1;
for (i = 0; i < NTRU_N + 1; i++)
{
if (f[i] != 0)
{
zero_f = 0;
break;
}
}
if (zero_f)
return 1;
while (f[0] == 0)
{
for (i = 0; i < NTRU_N; i++)
{
f[i] = f[i + 1];
c[NTRU_N - i] = c[NTRU_N - i - 1];
}
f[NTRU_N] = 0;
c[0] = 0;
k++;
}
constant_f = 1;
for (i = 1; i < NTRU_N + 1; i++)
{
if (f[i] != 0)
{
constant_f = 0;
break;
}
}
if (constant_f)
break;
deg_fg = 0;
for (i = NTRU_N; i >= 0; i--)
{
if (f[i] == 0 && g[i] != 0)
{
deg_fg = 1;
break;
}
else if (f[i] != 0 && g[i] == 0)
{
break;
}
}
if (deg_fg)
{
for (i = 0; i < NTRU_N + 1; i++)
{
aux = f[i];
f[i] = g[i];
g[i] = aux;
aux = b[i];
b[i] = c[i];
c[i] = aux;
}
}
if (f[0] == g[0])
{
for (i = 0; i < NTRU_N + 1; i++)
{
f[i] = (f[i] - g[i] + 3) % 3;
b[i] = (b[i] - c[i] + 3) % 3;
}
}
else
{
for (i = 0; i < NTRU_N + 1; i++)
{
f[i] = (f[i] + g[i] + 3) % 3;
b[i] = (b[i] + c[i] + 3) % 3;
}
}
}
k = k % NTRU_N;
for (i = NTRU_N - 1; i >= 0; i--)
{
if (i - k < 0)
r->coeffs[i - k + NTRU_N] = b[i] * f[0];
else
r->coeffs[i - k] = b[i] * f[0];
}
for (i = 0; i < NTRU_N; i++)
r->coeffs[i] = (r->coeffs[i] + 3) % 3;
return 0;
}
But this seems to be wrong. I tested it using the example giveng in Wikipedia: https://en.wikipedia.org/wiki/NTRUEncrypt . The polynomial -1 + x + x^2 - x^4 + x^6 + x^9 - x^10 should have as inverse the polynomial 1 + 2x + 2x^3 + 2x^4 + x^5 + 2x^7 + x^8 - x^10 , but I got the following result:
Polinomial:
-1 1 1 0 -1 0 1 0 0 1 -1
Inverse polinomial:
0 2 2 1 0 2 1 2 0 1 2
Where is the error in the implementation?

CGRect with variable Buttons per Row

Following situation:
i want to build a cgrect. what i need is: up to 4 rows. up to 6 columns.
so for example how it have to look:
i need min.
+ + + +
(1 row, 4 buttons)
and max i need:
+ + + + + +
+ + + + +
+ + + + + +
+ + + + +
(4 rows, 22 buttons)
what i want is to pass the BUTTONS_FOR_ROW1-4 data from an other VC. for the min example this is button_for_row1 = 4, button_for_row2 = 0, button_for_row3 = 0, button_for_row4 = 0.
for the max example button_for_row1 = 6, button_for_row2 = 5, button_for_row3 = 6, button_for_row4 = 5
my code now is this:
-(void) generateCardViews {
int positionsLeftInRow = _BUTTONS_PER_ROW;
int j = 0; // j = ROWNUMBER (j = 0) = ROW1, (j = 1) = ROW2...
for (int i = 0; i < [self.gameModel.buttons count]; i++) {
NSInteger value = ((ButtonModel *)self.gameModel.buttons[i]).value;
CGFloat x = (i % _BUTTONS_PER_ROW) * 121 + (i % _BUTTONS_PER_ROW) * 40 + 285;
if (j == 1) {
x += 80; // set additional indent (horizontal displacement)
}
if (j == 2) {
x -= 160;
}
CGFloat y = j * 122 + j * 40 + 158;
CGRect frame = CGRectMake(x, y, 125, 125);
ButtonView *cv = [[ButtonView alloc] initWithFrame:frame andPosition:i andValue:value];
if (!((ButtonModel *)self.gameModel.buttons[i]).outOfPlay) {
[self.boardView addSubview:cv];
if ([self.gameModel.turnedButtons containsObject: self.gameModel.buttons[i]]) {
[self.turnedButtonViews addObject: cv];
[cv flip];
}
}
if (--positionsLeftInRow == 0) {
j++;
positionsLeftInRow = _BUTTONS_PER_ROW;
if (j == 1) {
positionsLeftInRow = _BUTTONS_PER_ROW-1;
if (j == 2) {
positionsLeftInRow = _BUTTONS_PER_ROW-2;
}}
}
}
}
As you can see my code now have just BUTTONS_PER_ROW and not BUTTONS_FOR_ROW1 - 4.
the indent is for pushing a row left or right.
But i think this will work much easier, cause with my code i get some weird things when i make 22 Buttons.
thanks for help!
Your basic logic can be simplified. I would use a while loop here. For the case of a maximum number of buttons per row (the simpler case), this is an outline of what I think is a clear way to do what you want:
y = .... (initial value of y)
NSUInteger numberOfButtons = [self.gameModel.buttons count];
// Layout out the buttons, with no more than _BUTTONS_PER_ROW in each row
NSUInteger buttonsLeft = numberOfButtons;
NSUInteger buttonsInRow = 0;
while(buttonsLeft>0)
{
if(buttonsInRow>_BUTTONS_PER_ROW)
{
// Increment y, reset x
y += ....
x = .... (initial position for x)
buttonsInRow = 0;
}
// Create a button
CGRect frame = CGRectMake(x, y, 125, 125);
ButtonView* buttonView = [[ButtonView .....
x += 80;
++buttonsInRow;
--buttonsLeft;
}
For the more general case, add a variable to keep track of the row number, and use and array with the maximum number of buttons per row that is loaded before entering the while loop.

how to convert modelview matrix to gluLookAt parameters?

I had a requirement in Bullet physics with Opengl where I have modelview matrix but need to get the same matrix by calling gluLookAt. Thanks in advance.
From any 4x4 matrix we can get gluLookAt parameters which are CameraPos, CameraTarget, UpVector.
Here is the code to get CameraPos, CameraTarget, UpVector from ModelView matrix.
float modelViewMat[16];
glGetFloatv(GL_MODELVIEW_MATRIX, modelViewMat);
// Here instead of model view matrix we can pass any 4x4 matrix.
float params[9];
GetGluLookAtParameters(modelViewMat, params);
CameraPos.x = params[0];
CameraPos.y = params[1];
CameraPos.z = params[2];
CameraTarget.x = params[3];
CameraTarget.y = params[4];
CameraTarget.z = params[5];
UpVector.x = params[6];
UpVector.y = params[7];
UpVector.z = params[8];
void GetGluLookAtParameters(float* m, float* gluLookAtParams)
{
VECTOR3D sideVector(m[0], m[4], m[8]);
VECTOR3D upVector(m[1], m[5], m[9]);
VECTOR3D forwardVector(-m[2], -m[6], -m[10]);
sideVector.Normalize();
upVector.Normalize();
forwardVector.Normalize();
float rotMat[16];
memcpy(rotMat, m, 16*sizeof(float));
rotMat[12] = rotMat[13] = rotMat[14] = rotMat[3] = rotMat[7] = rotMat[11] = 0.0f;
rotMat[15] = 1.0f;
float rotInvert[16];
__gluInvertMatrixd(rotMat, rotInvert);
float transMat[16];
memset(transMat, 0, 16*sizeof(float));
transMat[0] = transMat[5] = transMat[10] = transMat[15] = 1.0f;
MultMat(rotInvert, m, transMat);
gluLookAtParams[0] = -transMat[12];
gluLookAtParams[1] = -transMat[13];
gluLookAtParams[2] = -transMat[14];
gluLookAtParams[3] = -transMat[12] + forwardVector.x;
gluLookAtParams[4] = -transMat[13] + forwardVector.y;
gluLookAtParams[5] = -transMat[14] + forwardVector.z;
gluLookAtParams[6] = upVector.x;
gluLookAtParams[7] = upVector.y;
gluLookAtParams[8] = upVector.z;
}
void MultMat(float* a, float* b, float* result)
{
result[0] = a[0]*b[0] + a[4]*b[1] + a[8]*b[2] + a[12]*b[3];
result[1] = a[1]*b[0] + a[5]*b[1] + a[9]*b[2] + a[13]*b[3];
result[2] = a[2]*b[0] + a[6]*b[1] + a[10]*b[2] + a[14]*b[3];
result[3] = a[3]*b[0] + a[7]*b[1] + a[11]*b[2] + a[15]*b[3];
result[4] = a[0]*b[4] + a[4]*b[5] + a[8]*b[6] + a[12]*b[7];
result[5] = a[1]*b[4] + a[5]*b[5] + a[9]*b[6] + a[13]*b[7];
result[6] = a[2]*b[4] + a[6]*b[5] + a[10]*b[6] + a[14]*b[7];
result[7] = a[3]*b[4] + a[7]*b[5] + a[11]*b[6] + a[15]*b[7];
result[8] = a[0]*b[8] + a[4]*b[9] + a[8]*b[10] + a[12]*b[11];
result[9] = a[1]*b[8] + a[5]*b[9] + a[9]*b[10] + a[13]*b[11];
result[10] = a[2]*b[8] + a[6]*b[9] + a[10]*b[10] + a[14]*b[11];
result[11] = a[3]*b[8] + a[7]*b[9] + a[11]*b[10] + a[15]*b[11];
result[12] = a[0]*b[12] + a[4]*b[13] + a[8]*b[14] + a[12]*b[15];
result[13] = a[1]*b[12] + a[5]*b[13] + a[9]*b[14] + a[13]*b[15];
result[14] = a[2]*b[12] + a[6]*b[13] + a[10]*b[14] + a[14]*b[15];
result[15] = a[3]*b[12] + a[7]*b[13] + a[11]*b[14] + a[15]*b[15];
}
int __gluInvertMatrixd(const float src[16], float inverse[16])
{
int i, j, k, swap;
float t;
GLfloat temp[4][4];
for (i=0; i<4; i++)
for (j=0; j<4; j++)
temp[i][j] = src[i*4+j];
for(int i=0;i<16;i++)
inverse[i] = 0;
inverse[0] = inverse[5] = inverse[10] = inverse[15] = 1.0f;
for(i=0; i<4; i++)
{
swap = i;
for (j = i + 1; j < 4; j++)
if (fabs(temp[j][i]) > fabs(temp[i][i]))
swap = j;
if (swap != i) {
//Swap rows.
for (k = 0; k < 4; k++) {
t = temp[i][k];
temp[i][k] = temp[swap][k];
temp[swap][k] = t;
t = inverse[i*4+k];
inverse[i*4+k] = inverse[swap*4+k];
inverse[swap*4+k] = t;
}
}
if (temp[i][i] == 0)
return 0;
t = temp[i][i];
for (k = 0; k < 4; k++) {
temp[i][k] /= t;
inverse[i*4+k] /= t;
}
for (j = 0; j < 4; j++) {
if (j != i) {
t = temp[j][i];
for (k = 0; k < 4; k++) {
temp[j][k] -= temp[i][k]*t;
inverse[j*4+k] -= inverse[i*4+k]*t;
}
}
}
}
return 1;
}

Algorithm to find direction between two keys on the num pad?

Given the following direction enum:
typedef enum {
DirectionNorth = 0,
DirectionNorthEast,
DirectionEast,
DirectionSouthEast,
DirectionSouth,
DirectionSouthWest,
DirectionWest,
DirectionNorthWest
} Direction;
And number matrix similar to the numeric pad:
7 8 9
4 5 6
1 2 3
How would you write a function to return the direction between adjacent numbers from the matrix? Say:
1, 2 => DirectionEast
2, 1 => DirectionWest
4, 8 => DirectionNorthEast
1, 7 => undef
You may change the numeric values of the enum if you want to. Readable solutions preferred. (Not a homework, just an algorithm for an app I am working on. I have a working version, but I’m interested in more elegant takes.)
int direction_code(int a, int b)
{
assert(a >= 1 && a <= 9 && b >= 1 && b <= 9);
int ax = (a - 1) % 3, ay = (a - 1) / 3,
bx = (b - 1) % 3, by = (b - 1) / 3,
deltax = bx - ax, deltay = by - ay;
if (abs(deltax) < 2 && abs(deltay) < 2)
return 1 + (deltay + 1)*3 + (deltax + 1);
return 5;
}
resulting codes are
1 south-west
2 south
3 south-east
4 west
5 invalid
6 east
7 north-west
8 north
9 north-east
I would redefine the values in the enum so that North, South, East and West take a different bit each.
typedef enum {
undef = 0,
DirectionNorth = 1,
DirectionEast = 2,
DirectionSouth = 4,
DirectionWest = 8,
DirectionNorthEast = DirectionNorth | DirectionEast,
DirectionSouthEast = DirectionSouth | DirectionEast,
DirectionNorthWest = DirectionNorth | DirectionWest,
DirectionSouthWest = DirectionSouth | DirectionWest
} Direction;
With those new values:
int ax = ( a - 1 ) % 3, ay = ( a - 1 ) / 3;
int bx = ( b - 1 ) % 3, by = ( b - 1 ) / 3;
int diffx = std::abs( ax - bx );
int diffy = std::abs( ay - by );
int result = undef;
if( diffx <= 1 && diffy <= 1 )
{
result |= ( bx == ax - 1 ) ? DirectionWest : 0;
result |= ( bx == ax + 1 ) ? DirectionEast : 0;
result |= ( by == ay - 1 ) ? DirectionSouth : 0;
result |= ( by == ay + 1 ) ? DirectionNorth : 0;
}
return static_cast< Direction >( result );
Update: Finally, I think its correct now.
With this matrix of numbers the following holds true:
1) a difference of 1 (+ve or -ve) always implies that the direction is either east or west.
2) similary, a difference of 3 for direction north or south.
3) a difference of 4 north east or south west.
4) a difference of 2 north west or south east.