nettle curve 25519 result of base point * 1 - gmp

I am using GNU nettle library. I have the following code:
#include <nettle/curve25519.h>
uint8_t result[32], one[32];
for(auto &i : one) i = 0;
one[31] = 1;
curve25519_mul_g(result, one);
In the code, I multiplied 1 with the base point. The base point's x-coordinate is 9, so I would expect the result to be 9.
But instead, it gives me this number:
0xfd3384e132ad02a56c78f45547ee40038dc79002b90d29ed90e08eee762ae715.
Why does this code not generate 9?

Curve25519 clamps some bits of the scalar to 1 or 0. Specifically in Curve25519 scalar multiplication:
the lowest 3 bits are set to 0, to ensure the output point is only in the large subgroup, and
the highest bit (2^254) is set to one, to make sure the implementer does not skip any Montgomery-ladder steps.
After this clamping operation, the scalar multiplication algorithm will be executed. So in Curve25519, a point cannot be multiplied with 1.
In your case, however, there is even more going on. nettle uses a little-endian convention in their code. That is, when you execute
one[31] = 1;
you are actually setting the 2^248 bit to one, not the 2^0 bit. Then the clamped value will become k = 2^254 + 2^248.
The computation [2^254 + 2^248] * (9 : 1) results in 0x15e72a76ee8ee090ed290db90290c78d0340ee4755f4786ca502ad32e18433fd (big endian), which corresponds with your observation.

Related

Time complexity and integer inputs

I came across a question asking to describe the computational complexity in Big O of the following code:
i = 1;
while(i < N) {
i = i * 2;
}
I found this Stack Overflow question asking for the answer, with the most voted answer saying it is Log2(N).
On first thought that answer looks correct, however I remember learning about psuedo polynomial runtimes, and how computational complexity measures difficulty with respect to the length of the input, rather than the value.
So for integer inputs, the complexity should be in terms of the number of bits in the input.
Therefore, shouldn't this function be O(N)? Because every iteration of the loop increases the number of bits in i by 1, until it reaches around the same bits as N.
This code might be found in a function like the one below:
function FindNextPowerOfTwo(N) {
i = 1;
while(i < N) {
i = i * 2;
}
return i;
}
Here, the input can be thought of as a k-bit unsigned integer which we might as well imagine as having as a string of k bits. The input size is therefore k = floor(log(N)) + 1 bits of input.
The assignment i = 1 should be interpreted as creating a new bit string and assigning it the length-one bit string 1. This is a constant time operation.
The loop condition i < N compares the two bit strings to see which represents the larger number. If implemented intelligently, this will take time proportional to the length of the shorter of the two bit strings which will always be i. As we will see, the length of i's bit string begins at 1 and increases by 1 until it is greater than or equal to the length of N's bit string, k. When N is not a power of two, the length of i's bit string will reach k + 1. Thus, the time taken by evaluating the condition is proportional to 1 + 2 + ... + (k + 1) = (k + 1)(k + 2)/2 = O(k^2) in the worst case.
Inside the loop, we multiply i by two over and over. The complexity of this operation depends on how multiplication is to be interpreted. Certainly, it is possible to represent our bit strings in such a way that we could intelligently multiply by two by performing a bit shift and inserting a zero on the end. This could be made to be a constant-time operation. If we are oblivious to this optimization and perform standard long multiplication, we scan i's bit string once to write out a row of 0s and again to write out i with an extra 0, and then we perform regular addition with carry by scanning both of these strings. The time taken by each step here is proportional to the length of i's bit string (really, say that plus one) so the whole thing is proportional to i's bit-string length. Since the bit-string length of i assumes values 1, 2, ..., (k + 1), the total time is 2 + 3 + ... + (k + 2) = (k + 2)(k + 3)/2 = O(k^2).
Returning i is a constant time operation.
Taking everything together, the runtime is bounded from above and from below by functions of the form c * k^2, and so a bound on the worst-case complexity is Theta(k^2) = Theta(log(n)^2).
In the given example, you are not increasing the value of i by 1, but doubling it at every time, thus it is moving 2 times faster towards N. By multiplying it by two you are reducing the size of search space (between i to N) by half; i.e, reducing the input space by the factor of 2. Thus the complexity of your program is - log_2 (N).
If by chance you'd be doing -
i = i * 3;
The complexity of your program would be log_3 (N).
It depends on important question: "Is multiplication constant operation"?
In real world it is usually considered as constant, because you have fixed 32 or 64 bit numbers and multiplicating them takes always same (=constant) time.
On the other hand - you have limitation that N < 32/64 bit (or any other if you use it).
In theory where you do not consider multiplying as constant operation or for some special algorithms where N can grow too much to ignore the multiplying complexity, you are right, you have to start thinking about complexity of multiplying.
The complexity of multiplying by constant number (in this case 2) - you have to go through each bit each time and you have log_2(N) bits.
And you have to do hits log_2(N) times before you reach N
Which ends with complexity of log_2(N) * log_2(N) = O(log_2^2(N))
PS: Akash has good point that multiply by 2 can be written as constant operation, because the only thing you need in binary is to "add zero" (similar to multiply by 10 in "human readable" format, you just add zero 4333*10 = 43330)
However if multiplying is not that simple (you have to go through all bits), the previous answer is correct

Find global maximum in the lest number of computations

Let's say I have a function f defined on interval [0,1], which is smooth and increases up to some point a after which it starts decreasing. I have a grid x[i] on this interval, e.g. with a constant step size of dx = 0.01, and I would like to find which of those points has the highest value, by doing the smallest number of evaluations of f in the worst-case scenario. I think I can do much better than exhaustive search by applying something inspired with gradient-like methods. Any ideas? I was thinking of something like a binary search perhaps, or parabolic methods.
This is a bisection-like method I coded:
def optimize(f, a, b, fa, fb, dx):
if b - a <= dx:
return a if fa > fb else b
else:
m1 = 0.5*(a + b)
m1 = _round(m1, a, dx)
fm1 = fa if m1 == a else f(m1)
m2 = m1 + dx
fm2 = fb if m2 == b else f(m2)
if fm2 >= fm1:
return optimize(f, m2, b, fm2, fb, dx)
else:
return optimize(f, a, m1, fa, fm1, dx)
def _round(x, a, dx, right = False):
return a + dx*(floor((x - a)/dx) + right)
The idea is: find the middle of the interval and compute m1 and m2- the points to the right and to the left of it. If the direction there is increasing, go for the right interval and do the same, otherwise go for the left. Whenever the interval is too small, just compare the numbers on the ends. However, this algorithm still does not use the strength of the derivatives at points I computed.
Such a function is called unimodal.
Without computing the derivatives, you can work by
finding where the deltas x[i+1]-x[i] change sign, by dichotomy (the deltas are positive then negative after the maximum); this takes Log2(n) comparisons; this approach is very close to what you describe;
adapting the Golden section method to the discrete case; it takes Logφ(n) comparisons (φ~1.618).
Apparently, the Golden section is more costly, as φ<2, but actually the dichotomic search takes two function evaluations at a time, hence 2Log2(n)=Log√2(n) .
One can show that this is optimal, i.e. you can't go faster than O(Log(n)) for an arbitrary unimodal function.
If your function is very regular, the deltas will vary smoothly. You can think of the interpolation search, which tries to better predict the searched position by a linear interpolation rather than simple halving. In favorable conditions, it can reach O(Log(Log(n)) performance. I don't know of an adaptation of this principle to the Golden search.
Actually, linear interpolation on the deltas is very close to parabolic interpolation on the function values. The latter approach might be the best for you, but you need to be careful about the corner cases.
If derivatives are allowed, you can use any root solving method on the first derivative, knowing that there is an isolated zero in the given interval.
If only the first derivative is available, use regula falsi. If the second derivative is possible as well, you may consider Newton, but prefer a safe bracketing method.
I guess that the benefits of these approaches (superlinear and quadratic convergence) are made a little useless by the fact that you are working on a grid.
DISCLAIMER: Haven't test the code. Take this as an "inspiration".
Let's say you have the following 11 points
x,f(x) = (0,3),(1,7),(2,9),(3,11),(4,13),(5,14),(6,16),(7,5),(8,3)(9,1)(1,-1)
you can do something like inspired to the bisection method
a = 0 ,f(a) = 3 | b=10,f(b)=-1 | c=(0+10/2) f(5)=14
from here you can see that the increasing interval is [a,c[ and there is no need to that for the maximum because we know that in that interval the function is increasing. Maximum has to be in interval [c,b]. So at the next iteration you change the value of a s.t. a=c
a = 5 ,f(a) = 14 | b=10,f(b)=-1 | c=(5+10/2) f(6)=16
Again [a,c] is increasing so a is moved on the right
you can iterate the process until a=b=c.
Here the code that implements this idea. More info here:
int main(){
#define STEP (0.01)
#define SIZE (1/STEP)
double vals[(int)SIZE];
for (int i = 0; i < SIZE; ++i) {
double x = i*STEP;
vals[i] = -(x*x*x*x - (0.6)*(x*x));
}
for (int i = 0; i < SIZE; ++i) {
printf("%f ",vals[i]);
}
printf("\n");
int a=0,b=SIZE-1,c;
double fa=vals[a],fb=vals[b] ,fc;
c=(a+b)/2;
fc = vals[c];
while( a!=b && b!=c && a!=c){
printf("%i %i %i - %f %f %f\n",a,c,b, vals[a], vals[c],vals[b]);
if(fc - vals[c-1] > 0){ //is the function increasing in [a,c]
a = c;
}else{
b=c;
}
c=(a+b)/2;
fa=vals[a];
fb=vals[b];
fc = vals[c];
}
printf("The maximum is %i=%f with %f\n", c,(c*STEP),vals[a]);
}
Find points where derivative(of f(x))=(df/dx)=0
for derivative you could use five-point-stencil or similar algorithms.
should be O(n)
Then fit those multiple points (where d=0) on a polynomial regression / least squares regression .
should be also O(N). Assuming all numbers are neighbours.
Then find top of that curve
shouldn't be more than O(M) where M is resolution of trials for fit-function.
While taking derivative, you could leap by k-length steps until derivate changes sign.
When derivative changes sign, take square root of k and continue reverse direction.
When again, derivative changes sign, take square root of new k again, change direction.
Example: leap by 100 elements, find sign change, leap=10 and reverse direction, next change ==> leap=3 ... then it could be fixed to 1 element per step to find exact location.
I am assuming that the function evaluation is very costly.
In the special case, that your function could be approximately fitted with a polynomial, you can easily calculate the extrema in least number of function evaluations. And since you know that there is only one maximum, a polynomial of degree 2 (quadratic) might be ideal.
For example: If f(x) can be represented by a polynomial of some known degree, say 2, then, you can evaluate your function at any 3 points and calculate the polynomial coefficients using Newton's difference or Lagrange interpolation method.
Then its simple to solve for the maximum for this polynomial. For a degree 2 you can easily get a closed form expression for the maximum.
To get the final answer you can then search in the vicinity of the solution.

Units conversion on a PIC 18F2431

I have the following conversion given Pressure per Square Inch (PSI) and Megapascals (MPa):
psi = MPa*1.45038;
I need the lowest value possible after conversion to be 1 PSI. An example of what I am looking for is:
psi = ((long)MPa*145)/100
Is there anyway to optimize this for memory and speed by not using float or long? I will be implementing this conversion on a microcontroller (PIC 18F2431).
You should divide by powers of 2 instead which is far cheaper than division by any other values. And if the type can't be negative then use an unsigned type instead. Depending on the type of MPa and its maximum value you can choose different denominator to suite your needs. No need to cast to a wider type if the multiplication won't overflow
For example if MPa is of type uint16_t you can do psi = MPa*95052/(1UL << 16); (95052/65536 ≈ 1.450378)
If MPa is not larger than 1024 or 210 then you can multiply it with 221 without overflowing, thus you can increase the numerator/denominator for more precision
psi = MPa*3041667/(1UL << 21);
Edit:
On the PIC 18F2431 int is a 16-bit type. That means 95052 will be of type long and MPa will be promoted to long in the expression. If you don't need much precision then change the scaling to fit in an int/int16_t to avoid dealing with long
In case MPa is not larger than 20 you can divide it by 2048 which is the largest power of 2 that is less than or equal to 216/20.
psi = MPa*2970/(1U << 11);
Note that the * and / have equal precedence so it'll be evaluated from left to right, and the above equation will be identical to
psi = (MPa*2970)/2048; // = MPa*1.4501953125)
no need for such excessive parentheses
Edit 2:
Unfortunately if the range of MPa is [0, 2000] then you can only multiply it by 32 without overflowing a 16-bit unsigned int. The closest ratio that you can achieve is 46/32 = 1.4375 so if you need more precision, there's no way other than using long. Anyway integer math with long is still a lot faster than floating-point math on the PIC MCU, and cost significantly less code space
Calculate the largest N such that MPa*1.45038*2^N < 2^32
Calculate the constant value of K = floor(1.45038*2^N) once
Calculate the value of psi = (MPa*K)>>N for every value of MPa
Since 0 <= MPa <= 2000, you must choose N such that 2000*1.45038*2^N < 2^32:
2^N < 2^32/(2000*1.45038)
N < log(2^32/(2000*1.45038))
N < 20.497
N = 20
Therefore, K = floor(1.45038*2^N) = floor(1.45038*2^20) = 1520833.
So for every value of MPa, you can calculate psi = (MPa*1520833)>>20.
You'll need to make sure that MPa is unsigned (or cast it accordingly).
Using this method will allow you to avoid floating-point operations.
For better accuracy, you can use round instead of floor, giving you K = 1520834.
In this specific case it will work out fine, because 2000*1520834 is smaller than 2^32.
But with a different maximum value of MPa or a different conversion scalar, it might not.
In any case, the difference in the outcome of psi for each value of K is neglectable.
BTW, if you don't mind the additional memory, then you can use a look-up table instead:
Add a pre-calculated global variable const unsigned short lut[2001] = {...}
For every value of MPa, calculate psi = lut[MPa] instead of psi = (MPa*K)>>N
So instead of mul + shift operations, your program will perform a load operation
Please note, however, that whether or not this is more efficient in terms of runtime performance depends on several things, such as the accessibility of the memory segment in which you allocate the look-up table, the architecture of the processor at hand, runtime caching heuristics, etc.
So you will need to apply some profiling on your program in order to decide which approach is better.

Differences between mult and div operations on floating point numbers

Is any difference in computation precision for these 2 cases:
1) x = y / 1000d;
2) x = y * 0.001d;
Edit: Shoudn't add C# tag. Question is only from 'floating-point' point of view. I don't wanna know what is faster, I need to know what case will give me 'better precision'.
No, they're not the same - at least not with C#, using the version I have on my machine (just standard .NET 4.5.1) on my processor - there are enough subtleties involved that I wouldn't like to claim it'll do the same on all machines, or with all languages. This may very well be a language-specific question after all.
Using my DoubleConverter class to show the exact value of a double, and after a few bits of trial and error, here's a C# program which at least on my machine shows a difference:
using System;
class Program
{
static void Main(string[] args)
{
double input = 9;
double x1 = input / 1000d;
double x2 = input * 0.001d;
Console.WriteLine(x1 == x2);
Console.WriteLine(DoubleConverter.ToExactString(x1));
Console.WriteLine(DoubleConverter.ToExactString(x2));
}
}
Output:
False
0.00899999999999999931998839741709161899052560329437255859375
0.009000000000000001054711873393898713402450084686279296875
I can reproduce this in C with the Microsoft C compiler - apologies if it's horrendous C style, but I think it at least demonstrates the differences:
#include <stdio.h>
void main(int argc, char **argv) {
double input = 9;
double x1 = input / 1000;
double x2 = input * 0.001;
printf("%s\r\n", x1 == x2 ? "Same" : "Not same");
printf("%.18f\r\n", x1);
printf("%.18f\r\n", x2);
}
Output:
Not same
0.008999999999999999
0.009000000000000001
I haven't looked into the exact details, but it makes sense to me that there is a difference, because dividing by 1000 and multiplying by "the nearest double to 0.001" aren't the same logical operation... because 0.001 can't be exactly represented as a double. The nearest double to 0.001 is actually:
0.001000000000000000020816681711721685132943093776702880859375
... so that's what you end up multiplying by. You're losing information early, and hoping that it corresponds to the same information that you lose otherwise by dividing by 1000. It looks like in some cases it isn't.
you are programming in base 10 but the floating point is base 2 you CAN represent 1000 in base 2 but cannot represent 0.001 in base 2 so you have chosen bad numbers to ask your question, on a computer x/1000 != x*0.001, you might get lucky most of the time with rounding and more precision but it is not a mathematical identity.
Now maybe that was your question, maybe you wanted to know why x/1000 != x*0.001. And the answer to that question is because this is a binary computer and it uses base 2 not base 10, there are conversion problems with 0.001 when going to base 2, you cannot exactly represent that fraction in an IEEE floating point number.
In base 10 we know that if we have a fraction with a factor of 3 in the denominator (and lacking one in the numerator to cancel it out) we end up with an infinitely repeated pattern, basically we cannot accurately represent that number with a finite set of digits.
1/3 = 0.33333...
Same problem when you try to represent 1/10 in base 2. 10 = 2*5 the 2 is okay 1/2, but the 5 is the real problem 1/5.
1/10th (1/1000 works the same way). Elementary long division:
0 000110011
----------
1010 | 1.000000
1010
------
1100
1010
----
10000
1010
----
1100
1010
----
10
we have to keep pulling down zeros until we get 10000 10 goes into 16 one time, remainder 6, drop the next zero. 10 goes into 12 1 time remainder 2. And we repeat the pattern so you end up with this 001100110011 repeated forever. Floating point is a fixed number of bits, so we cannot represent an infinite pattern.
Now if your question has to do with something like is dividing by 4 the same as multiplying by 1/4th. That is a different question. Aanswer is it should be the same, consumes more cycles and/or logic to do a divide than multiply but works out with the same answer in the end.
Probably not. The compiler (or the JIT) is likely to convert the first case to the second anyway, since multiplication is typically faster than division. You would have to check this by compiling the code (with or without optimizations enabled) and then examining the generated IL with a tool like IL Disassembler or .NET Reflector, and/or examining the native code with a debugger at runtime.
No, there is no any difference. Except if you set custom rounding mode.
gcc produces ((double)0.001 - (double)1.0/1000) == 0.0e0
When compiler converts 0.001 to binary it divides 1 by 1000. It uses software floating point simulation compatible with target architecture to do this.
For high precision there are long double (80-bit) and software simulation of any precision.
PS I used gcc for 64 bit machine, both sse and x87 FPU.
PPS With some optimizations 1/1000.0 could be more precise on x87 since x87 uses 80-bit internal representation and 1000 == 1000.0. It is true if you use result for next calculations promptly. If you return/write to memory it calculates 80-bit value and then rounds it to 64-bit. But SSE is more common to use for double.

What is reduction variable? Could anyone give me some examples?

What is reduction variable?
Could anyone give me some examples?
Here's a simple example in a C-like language of computing the sum of an array:
int x = 0;
for (int i = 0; i < n; i++) {
x += a[i];
}
In this example,
i is an induction variable - in each iteration it changes by some constant. It can be +1 (as in the above example) or *2 or /3 etc., but the key is that in all the iterations the number is the same.
In other words, in each iteration i_new = i_old op constant, where op is +, *, etc., and neither op nor constant change between iterations.
x is a reduction variable - it accumulates data from one iteration to the next. It always has some initialization (x = 0 in this case), and while the data accumulated can be different in each iteration, the operator remains the same.
In other words, in each iteration x_new = x_old op data, and op remains the same in all iterations (though data may change).
In many languages there's a special syntax for performing something like this - often called a "fold" or "reduce" or "accumulate" (and it has other names) - but in the context of LLVM IR, an induction variable will be represented by a phi node in a loop between a binary operation inside the loop and the initialization value before it.
Commutative* operations in reduction variables (such as addition) are particularly interesting for an optimizing compiler because they appear to show a stronger dependency between iterations than there really is; for instance the above example could be rewritten into a vectorized form - adding, say, 4 numbers at a time, and followed by a small loop to sum the final vector into a single value.
* there are actually more conditions that the reduction variable has to fulfill before a vectorization like this can be applied, but that's really outside the scope here