Mathematical equivalent to double right shift and AND operator? - operators

I have to rewrite the following expression in my program code:
arr[i] = (arr[i] << 16) & 0x00FF0000;
Can I rewrite this with multiplication/division operators ?

it should be
arr[i] = (arr[i] % 256) * 65536;
but bitwise operations are faster

Related

Calc CRC8 for objective c

I need the method to check sum CRC8.
I found this code, but it's not working:
- (int)crc8Checksum:(NSString*)dataFrame{
char j;
int crc8 = 0;
int x = 0;
for (int i = 0; i < [dataFrame length]; i++){
x = [dataFrame characterAtIndex:i];
for (int k = 0; k < 8; k++){
j = 1 & (x ^ crc8);
crc8 = floor0(crc8 / 2) & 0xFF;
x = floor0(x / 2) & 0xFF;
if (j != 0 ){
crc8 = crc8 ^ 0x8C;
}
}
}
return crc8;
}
Help me please!
What do you mean "it's not working"? There are 14 different CRC-8 definitions in this catalog, and probably many more out there in the wild. Do you have some CRC values you are comparing to? Is there documentation on what CRC you actually need? What are your test messages and corresponding expected CRCs?
You can't just pick some random CRC-8 code and expect it to do what you need.
That particular code computes a CRC-8/MAXIM in the linked catalog. However it is truly awful code. With unnecessary divides and floors. Here is a better, simpler, faster inner loop:
crc8 ^= x;
for (int k = 0; k < 8; k++)
crc8 = crc8 & 1 ? (crc8 >> 1) ^ 0x8c : crc8 >> 1;
You can get it faster still with tables and algorithms that compute the CRC a byte at a time or a machine word at a time.
The x in the code has its own problems, since an NSString can be a string of unicode characters, so characterAtIndex may not return a byte, and length may not return the number of bytes. You need a way to get the message as a series of bytes.

Comparison between 2D and 3D Affine transforms

Is it expected that the following test should fail?
The test compares results of a 2D and a 3D AffineTransformation. Both are constructed to have unit scaling and zero offsets in the y and z direction, but to have non-zero and non-unity scaling and offset in the x direction. All other off-diagonal elements are zero. It is my belief that these transformations are identical in the x and y directions, and hence should produce identical results.
Furthermore I have found that the test passes if I use this Kernel:
using K = CGAL::Exact_predicates_exact_constructions_kernel;
Is it to be expected that the test passes if I use this Kernel? Should the test fail with either kernel or pass with either kernel?
TEST(TransformerTest, testCGALAffine) {
using K = CGAL::Exact_predicates_inexact_constructions_kernel;
using Float = typename K::FT;
using Transformation_2 = K::Aff_transformation_2;
using Transformation_3 = K::Aff_transformation_3;
using Point_2 = typename K::Point_2;
using Point_3 = typename K::Point_3;
double lowerCorner(17.005142946538115);
double upperCorner(91.940521484752139);
int resolution = 48;
double tmpScaleX((upperCorner - lowerCorner) / resolution);
Float scaleX(tmpScaleX);
Float zero(0);
Float unit(1);
// create a 2D voxel to world transform
Transformation_2 transformV2W_2(scaleX, zero, Float(lowerCorner),
zero, unit, zero,
unit);
// create it's inverse: a 2D world to voxel transform
auto transformW2V_2 = transformV2W_2.inverse();
// create a 3D voxel to world transform
Transformation_3 transformV2W_3(scaleX, zero, zero, Float(lowerCorner),
zero, unit, zero, zero,
zero, zero, unit, zero,
unit);
// create it's inverse: a 3D world to voxel transform
auto transformW2V_3 = transformV2W_3.inverse();
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 2; ++j) {
EXPECT_EQ(transformV2W_2.cartesian(i, j), transformV2W_3.cartesian(i, j)) << i << ", " << j;
EXPECT_EQ(transformW2V_2.cartesian(i, j), transformW2V_3.cartesian(i, j)) << i << ", " << j;
}
}
std::mt19937_64 rng(0);
std::uniform_real_distribution<double> randReal(0, resolution);
// compare the results of 2D and 3D transformations of random locations
for (int i = 0; i < static_cast<int>(1e4); ++i) {
Float x(randReal(rng));
Float y(randReal(rng));
auto world_2 = transformV2W_2(Point_2(x, y));
auto world_3 = transformV2W_3(Point_3(x, y, 0));
EXPECT_EQ(world_2.x(), world_3.x()) << world_2 << ", " << world_3;
auto voxel_2 = transformW2V_2(world_2);
auto voxel_3 = transformW2V_3(world_3);
EXPECT_EQ(voxel_2.x(), voxel_3.x()) << voxel_2 << ", " << voxel_3;
}
}

How does this message splitting work?

I have been trying to reverse engineer various encryption algorithms in compiled code recently, and I came upon this code. It is a part of a RSA algorithm. I've noted that the key size is too small to encrypt/decrypt the data it's supposed to (in this case an int), so the code splits the message into two pieces, and encrypt/decrypt each, then sum them together. I've pulled the segments of code that splits and joins the message, and experimented with it. It appears that the numerical values that it uses is dependent on the n modulus. So, what exactly is this scheme, and how does it work?
uint n = 32437;
uint origVal = 12345;
uint newVal = 0;
for (int i = 0; i < 2; ++i)
{
ulong num = (ulong)origVal * 43827549;
//uint num2 = ((origVal - (uint)(num >> 32)) / 2 + (uint)(num >> 32)) >> 14;
uint num2 = (origVal + (uint)(num >> 32)) / 32768;
origVal -= num2 * n;
// RSA encrypt/decrypt here
newVal *= n;
newVal += origVal;
origVal = num2;
}
// Put newVal into origVal, to reverse
origVal = newVal;
newVal = 0;
for (int i = 0; i < 2; ++i)
{
ulong num = (ulong)origVal * 43827549;
//uint num2 = ((origVal - (uint)(num >> 32)) / 2 + (uint)(num >> 32)) >> 14;
uint num2 = (origVal + (uint)(num >> 32)) / 32768;
origVal -= num2 * n;
// RSA encrypt/decrypt here
newVal *= n;
newVal += origVal;
origVal = num2;
}
Note: it seems the operations applied are symmetric.
After using various values for origVal, I've found out that the first three lines after the for loop is just a division, with the line immediately after that a modulo operation. The lines
ulong num = (ulong)origVal * 43827549;
//uint num2 = ((origVal - (uint)(num >> 32)) / 2 + (uint)(num >> 32)) >> 14;
uint num2 = (origVal + (uint)(num >> 32)) / 32768;
translates into
uint valDivN = origVal / n;
and
origVal -= num2 * n;
into
origVal = origVal % n;
So the final code inside the for loop looks like this:
uint valDivN = origVal / n;
origVal = origVal % n;
// RSA encrypt/decrypt here
newVal*= n;
newVal+= origVal;
origVal = valDivN;
Analysis
This code splits values by taking the modulo of the original value, transforming it, then multiplying it by n, and tacking the transformation of the previous quotient onto the result. The lines uint valDivN = origVal / n; and newVal*= n; form inverse operations. You can think of the input message as having two "boxes". After the loop has run through, you get the transformed value put in opposite "boxes". When the message is decrypted, the two values in the "boxes" are reverse transformed, and put in their original spots in the "boxes". The reason the divisor is n is to keep the value being encrypted/decrypted under n, as the maximum value you can encrypt with RSA is no larger than n. There is no possibility of the wrong value being decrypted, as the code processes the packed message and extracts the part that should be decrypted prior to decrypting. The loop only runs twice because there is no chance for the quotient to exceed the size of an int (since the input is an int).

dot product using cblas is slow

I want to calculate the product A^T*A ( A is 2000x1000 Matrix). Also i only want to solve the upper triangular Matrix. In the inner loop i have to solve the dot product of two vectors.
Now, here is the problem. Using cblas ddot() is not faster than calculating the dot product with a loop. How is this possible? (using Intel Core (TM)i7 CPU M620 #2,67GHz, 1,92GB RAM)
The problem is caused essentially by matrix size, not by ddot. Your matrices are so large that they do not fit in the cache memory. The solution is to rearrange the three nested loops such that as much as possible can be done with a line in cache, so reducing cache refreshes. A model implementation follows for both the ddot and an daxpy approach. On my computer the time consumption was about 15:1.
In other words: never, never, never program a matrix multiplication along the "row times column" scheme that we learned in school.
/*
Matrix product of A^T * A by two methods.
1) "Row times column" as we learned in school.
2) With rearranged loops such that need for cash refreshes is reduced
(this can be improved even more).
Compile: gcc -o aT_a aT_a.c -lgslcblas -lblas -lm
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cblas.h>
#define ROWS 2000
#define COLS 1000
static double a[ROWS][COLS];
static double c[COLS][COLS];
static void dot() {
int i, j;
double *ai, *bj;
ai = a[0];
for (i=0; i<COLS; i++) {
bj = a[0];
for (j=0; j<COLS; j++) {
c[i][j] = cblas_ddot(ROWS,ai,COLS,bj,COLS);
bj += 1;
}
ai += 1;
}
}
static void axpy() {
int i, j;
double *ci, *bj, aij;
for (i=0; i<COLS; i++) {
ci = c[i];
for (j=0; j<COLS; j++) ci[j] = 0.;
for (j=0; j<ROWS; j++) {
aij = a[j][i];
bj = a[j];
cblas_daxpy(COLS,aij,bj,1,ci,1);
}
}
}
int main(int argc, char** argv) {
clock_t t0, t1;
int i, j;
for (i=0; i<ROWS; ++i)
for (j=0; j<COLS; ++j)
a[i][j] = i+j;
t0 = clock();
dot();
t0 = clock();
printf("Time for DOT : %f sec.\n",(double)t0/CLOCKS_PER_SEC);
axpy();
t1 = clock();
printf("Time for AXPY: %f sec.\n",(double)(t1-t0)/CLOCKS_PER_SEC);
return 0;
}
The CBLAS dot product is effectively just a computation in slightly unrolled loop. The netlib Fortran is just this:
DO I = MP1,N,5
DTEMP = DTEMP + DX(I)*DY(I) + DX(I+1)*DY(I+1) +
$ DX(I+2)*DY(I+2) + DX(I+3)*DY(I+3) + DX(I+4)*DY(I+4)
END DO
ie. just a loop unrolled to a stride of 5.
If you must use a ddot style dot product for your operation, you might get a performance boost by re-writing your loop to use SSE2 intrinsics:
#include <emmintrin.h>
double ddotsse2(const double *x, const double *y, const int n)
{
double result[2];
int n2 = 2 * (n/2);
__m128d dtemp;
if ( (n % 2) == 0) {
dtemp = _mm_setzero_pd();
} else {
dtemp = _mm_set_sd(x[n] * y[n]);
}
for(int i=0; i<n2; i+=2) {
__m128d x1 = _mm_loadr_pd(x+i);
__m128d y1 = _mm_loadr_pd(y+i);
__m128d xy = _mm_mul_pd(x1, y1);
dtemp = _mm_add_pd(dtemp, xy);
}
_mm_store_pd(&result[0],dtemp);
return result[0] + result[1];
}
(not tested, never been compiled, buyer beware).
This may or may be faster than the standard BLAS implementation. You may also want to investigate whether further loop unrolling could improve performance.
If you're not using SSE2 intrinsics or using a data type that may not boost performance with them, you can try to transpose the matrix for an easy improvement in performance for larger matrix multiplications with cblas_?dot. Performing the matrix multiplication in blocks also helps.
void matMulDotProduct(int n, float *A, float* B, int a_size, int b_size, int a_row, int a_col, int b_row, int b_col, float *C) {
int i, j, k;
MKL_INT incx, incy;
incx = 1;
incy = b_size;
//copy out multiplying matrix from larger matrix
float *temp = (float*) malloc(n * n * sizeof(float));
for (i = 0; i < n; ++i) {
cblas_scopy(n, &B[(b_row * b_size) + b_col + i], incy, &temp[i * n], 1);
}
//transpose
mkl_simatcopy('R', 'T', n, n, 1.0, temp, 1, 1);
for (i = 0; i < n; i+= BLOCK_SIZE) {
for (j = 0; j < n; j++) {
for (k = 0; k < BLOCK_SIZE; ++k) {
C[((i + k) * n) + j] = cblas_sdot(n, &A[(a_row + i + k) * a_size + a_col], incx, &temp[n * j], 1);
}
}
}
free(temp);
}
On my machine, this code is about 1 order of magnitude faster than the the 3 loop code (but also 1 order of magnitude slower than cblas_?gemm call) for single precision floats and 2K by 2K matrices. (I'm using Intel MKL).

Separate signed int into bytes in NXC

Is there any way to convert a signed integer into an array of bytes in NXC? I can't use explicit type casting or pointers either, due to language limitations.
I've tried:
for(unsigned long i = 1; i <= 2; i++)
{
MM_mem[id.idx] = ((val & (0xFF << ((2 - i) * 8)))) >> ((2 - i) * 8));
id.idx++;
}
But it fails.
EDIT: This works... It just wasn't downloading. I've wasted about an hour trying to figure it out. >_>
EDIT: In NXC, >> is a arithmetic shift. int is a signed 16-bit integer type. A byte is the same thing as unsigned char.
NXC is 'Not eXactly C', a relative of C, but distinctly different from C.
How about
unsigned char b[4];
b[0] = (x & 0xFF000000) >> 24;
b[1] = (x & 0x00FF0000) >> 16;
b[2] = (x & 0x0000FF00) >> 8;
b[3] = x & 0xFF;
The best way to do this in NXC with the opcodes available in the underlying VM is to use FlattenVar to convert any type into a string (aka byte array with a null added at the end). It results in a single VM opcode operation where any of the above options which use shifts and logical ANDs and array operations will require dozens of lines of assembly language.
task main()
{
int x = Random(); // 16 bit random number - could be negative
string data;
data = FlattenVar(x); // convert type to byte array with trailing null
NumOut(0, LCD_LINE1, x);
for (int i=0; i < ArrayLen(data)-1; i++)
{
#ifdef __ENHANCED_FIRMWARE
TextOut(0, LCD_LINE2-8*i, FormatNum("0x%2.2x", data[i]));
#else
NumOut(0, LCD_LINE2-8*i, data[i]);
#endif
}
Wait(SEC_4);
}
The best way to get help with LEGO MINDSTORMS and the NXT and Not eXactly C is via the mindboards forums at http://forums.mindboards.net/
Question originally tagged c; this answer may not be applicable to Not eXactly C.
What is the problem with this:
int value;
char bytes[sizeof(int)];
bytes[0] = (value >> 0) & 0xFF;
bytes[1] = (value >> 8) & 0xFF;
bytes[2] = (value >> 16) & 0xFF;
bytes[3] = (value >> 24) & 0xFF;
You can regard it as an unrolled loop. The shift by zero could be omitted; the optimizer would certainly do so. Even though the result of right-shifting a negative value is not defined, there is no problem because this code only accesses the bits where the behaviour is defined.
This code gives the bytes in a little-endian order - the least-significant byte is in bytes[0]. Clearly, big-endian order is achieved by:
int value;
char bytes[sizeof(int)];
bytes[3] = (value >> 0) & 0xFF;
bytes[2] = (value >> 8) & 0xFF;
bytes[1] = (value >> 16) & 0xFF;
bytes[0] = (value >> 24) & 0xFF;