Please explain this code for Merkle–Hellman knapsack cryptosystem? - cryptography

This is the code snippet from a program that implements Merkle–Hellman knapsack cryptosystem.
// Generates keys based on input data size
private void generateKeys(int inputSize)
{
// Generating values for w
// This first value of the private key (w) is set to 1
w.addNode(new BigInteger("1"));
for (int i = 1; i < inputSize; i++)
{
w.addNode(nextSuperIncreasingNumber(w));
}
// Generate value for q
q = nextSuperIncreasingNumber(w);
// Generate value for r
Random random = new Random();
// Generate a value of r such that r and q are coprime
do
{
r = q.subtract(new BigInteger(random.nextInt(1000) + ""));
}
while ((r.compareTo(new BigInteger("0")) > 0) && (q.gcd(r).intValue() != 1));
// Generate b such that b = w * r mod q
for (int i = 0; i < inputSize; i++)
{
b.addNode(w.get(i).getData().multiply(r).mod(q));
}
}
Just tell me what is going on in the following lines:
do
{
r = q.subtract(new BigInteger(random.nextInt(1000) + ""));
}
while ((r.compareTo(new BigInteger("0")) > 0) && (q.gcd(r).intValue() != 1));
(1) Why is random number generated with upper bound 1000?
(2) Why is it subtracted from q?

The code is searching for a value that is co-prime with the already selected value q. In my opinion, it's doing so rather poorly, but you mention it's a simulator? I'm not sure what that means, but maybe it just means the code is quick and dirty rather than slow and secure.
Answering your questions directly:
Why is random number generated with upper bound 1000?
The Merkle-Hellman algorithm does indicate that r should be 'random'. The implementation for doing so is pretty haphazard; that might be what's thrown you off. The code is not technically an algorithm because the loop is not guaranteed to terminate. In theory, the psuedo-random candidate selection of r could be an arbitrarily long sequence of numbers which aren't co-prime to q, resulting in an infinite loop.
The upper bound of 1000 could be to ensure that the chosen r is sufficiently large. In general, large keys are harder to break than small keys, so if q is large, then this code will only find large r.
A more deterministic way to get a random co-prime would be to test each number lower than q, generating a list of co-primes and select one at random. This would probably be more secure, as an attacker knowing that q and r are within 1000 of each other would have a significantly reduced search space.
Why is it subtracted from q?
The subtraction is important because r must be less than q. The Merkle-Hellmen algorithm specifies it that way. I'm not convince that it needs to be that way. The public key is generated by multiplying each element in w by r and taking the modulus q. If r were very large, larger than q, it seems like it would further obfuscate q and each element in w.
The decryption step of Merkle-Hellmen, on the other hand, depends on the modular inverse of each encrypted letter a x r−1 mod q. This operation might be hampered by having r > q; it seems like it could still work out.
However, if nextInt can return 0, that iteration of the loop is a waste as a q and r must be different (gcd(a,a) is just a).
Breaking down the code:
do
Try it at least once. r is probably null or undefined before the method is called.
r = q.subtract(new BigInteger(random.nextInt(1000) + ""));
Find a candidate value that's between q and q - 1000.
while ((r.compareTo(new BigInteger("0")) > 0) && (q.gcd(r).intValue() != 1));
Keep going until you've found an r that is:
Greater than 0 r.compareTo(new BigInteger("0")) > 0, and
Is co-prime with q, q.gcd(r).intValue() != 1. Obviously, a randomly selected number is not guaranteed to be co-prime with another other number, so the randomly generated candidate might not be work for this q.
Does that clear it up? I have to admit that I'm not an expert on Merkle-Hellman.

Related

Not sure whether it's smaller or larger - Big O notation

Could one of you kindly to tell me whether it's smaller or bigger?
Is O(N * logK) bigger than O(N)? I think it is bigger because O(NlogN) is bigger than O(N), the linear one.
Yes, it should increase, unless for some reason K is always one, in which you wouldnt put the 'logK' in O(N*logK) and it would just be O(N) which is obv equal to O(N)
Think of it this way: What is O(N) and O(N*logK) saying?
Well O(N) is saying, for example, that you have something like an array with N elements in it. For each element you are doing an operation that takes constant time, ie adding a number to that element
While O(N*logK) is saying, not only do you need to do an operation for each element, you need to do an operation that takes logK time. Its important to note that K would denote something different than N in this case, for example you could have the array from the O(N) example plus another array with K elements. Heres a code example
public void SomeNLogKOperation(int[] nElements, int[] kElements){
//for each element in nElements, ie O(N)
for(int i = 0; i < nElements.length; i++){
//do operation that takes O(logK) time, now we have O(N*logK)
int val = operationThatTakesLogKTime(nElements[i], kElements)
}
}
public void SomeNOperation(int[] nElements){
//for each element in nElements, ie O(N)
for(int i = 0; i < nElements.length; i++){
//simple operation that takes O(1) time, so we have O(N*1) = O(N)
int val = nElements[i] + 1;
}
}
I absolutely missed you used log(K) in the expression - this answer is invalid if K is not dependent on N and more, less than 1. But the you use O NlogN in the next
sentence so lets go with N log N.
So for N = 1000 O(N) is exactly that.
O(NlogN) is logN more. Usually we are looking at a base 2 log, so O(NlogN) is about 10000.
The difference is not large but very measurable.
For N = 1,000,000
You have O(N) at 1 million
O(NlogN) would sit comfortably at 20 million.
It is helpful to know your logs to common values
8-bit max 255 => log 255 = 8
10 bit max 1024 => log 1024 = 10: Conclude log 1000 is very close to 10.
16 bit 65735 => log 65735 = 16
20 bits max 1024072 = 20 bits very close to 1 million.
This question is not asked in the context of algorithmic time complexity. Only math is required here.
So we are comparing too functions. It all depends on context. What do we know of N and K? If K and N are both free variables that tend to infinity, then yes, O(N * log k) is "bigger" than O(N), in the sense that
N = O(N * log k) but
N * log k ≠ O(N).
However, if K is some constant parameter > 0, then they are the same complexity class.
On the other hand, K could be 0 or negative, in which case we obtain different relationships. So you need to define/provide more context to be able to make this comparison.

Trade off between Linear and Binary Search

I have a list of elements to be searched in a dataset of variable lengths. I have tried binary search and I found it is not always efficient when the objective is to search a list of elements.
I did the following study and conclude that if the number of elements to be searched is less than 5% of the data, binary search is efficient, other wise the Linear search is better.
Below are the details
Number of elements : 100000
Number of elements to be searched: 5000
Number of Iterations (Binary Search) =
log2 (N) x SearchCount=log2 (100000) x 5000=83048
Further increase in the number of search elements lead to more iterations than the linear search.
Any thoughts on this?
I am calling the below function only if the number elements to be searched is less than 5%.
private int SearchIndex(ref List<long> entitylist, ref long[] DataList, int i, int len, ref int listcount)
{
int Start = i;
int End = len-1;
int mid;
while (Start <= End)
{
mid = (Start + End) / 2;
long target = DataList[mid];
if (target == entitylist[listcount])
{
i = mid;
listcount++;
return i;
}
else
{
if (target < entitylist[listcount])
{
Start = mid + 1;
}
if (target > entitylist[listcount])
{
End = mid - 1;
}
}
}
listcount++;
return -1; //if the element in the list is not in the dataset
}
In the code I retun the index rather than the value because, I need to work with Index in the calling function. If i=-1, the calling function resets the value to the previous i and calls the function again with a new element to search.
In your problem you are looking for M values in an N long array, N > M, but M can be quite large.
Usually this can be approached as M independent binary searches (or even with the slight optimization of using the previous result as a starting point): you are going to O(M*log(N)).
However, using the fact that also the M values are sorted, you can find all of them in one pass, with linear search. In this case you are going to have your problem O(N). In fact this is better than O(M*log(N)) for M large.
But you have a third option: since M values are sorted, binary split M too, and every time you find it, you can limit the subsequent searches in the ranges on the left and on the right of the found index.
The first look-up is on all the N values, the second two on (average) N/2, than 4 on N/4 data,.... I think that this scale as O(log(M)*log(N)). Not sure of it, comments welcome!
However here is a test code - I have slightly modified your code, but without altering its functionality.
In case you have M=100000 and N=1000000, the "M binary search approach" takes about 1.8M iterations, that's more that the 1M needed to scan linearly the N values. But with what I suggest it takes just 272K iterations.
Even in case the M values are very "collapsed" (eg, they are consecutive), and the linear search is in the best condition (100K iterations would be enough to get all of them, see the comments in the code), the algorithm performs very well.

Calculate function time complexity

I am trying to calculate the time complexity of this function
Code
int Almacen::poner_items(id_sala s, id_producto p, int cantidad){
it_prod r = productos.find(p);
if(r != productos.end()) {
int n = salas[s - 1].size();
int m = salas[s - 1][0].size();
for(int i = n - 1; i >= 0 && cantidad > 0; --i) {
for(int j = 0; j < m && cantidad > 0; ++j) {
if(salas[s - 1][i][j] == "NULL") {
salas[s - 1][i][j] = p;
r->second += 1;
--cantidad;
}
}
}
}
else {
displayError();
return -1;
}
return cantidad;
}
the variable productos is a std::map and its find method has a time complexity of Olog(n) and other variable salas is a std::vector.
I calculated the time and I found that it was log(n) + nm but am not sure if it is the correct expression or I should leave it as nm because it is the worst or if I whould use n² only.
Thanks
The overall function is O(nm). Big-O notation is all about "in the limit of large values" (and ignores constant factors). "Small" overheads (like an O(log n) lookup, or even an O(n log n) sort) are ignored.
Actually, the O(n log n) sort case is a bit more complex. If you expect m to be typically the same sort of size as n, then O(nm + nlogn) == O(nm), if you expect n ≫ m, then O(nm + nlogn) == O(nlogn).
Incidentally, this is not a question about C++.
In general when using big O notation, you only leave the most dominant term when taking all variables to infinity.
n by itself is much larger than log n at infinity, so even without m you can (and generally should) drop the log n term, so O(nm) looks fine to me.
In non-theoretical use cases, it is sometimes important to understand the actual complexity (for non-infinite inputs), since sometimes algorithms that are slow at infinity can produce better results for shorter inputs (there are some examples where O(1) algorithms have such a terrible constant that an exponential algorithm does better in real life). quick sort is considered a practical example of an O(n^2) algorithm that often does better than it's O(n log n) counterparts.
Read about "Big O Notation" for more info.
let
k = productos.size()
n = salas[s - 1].size()
m = salas[s - 1][0].size()
your algorithm is O(log(k) + nm). You need to use a distinct name for each independent variable
Now it might be the case that there is a relation between k, n, m and you can re-label with a reduced set of variables, but that is not discernible from your code, you need to know about the data.
It may also be the case that some of these terms won't grow large, in which case they are actually constants, i.e. O(1).
E.g. you may know k << n, k << m and n ~= m , which allows you describe it as O(n^2)

Time complexity of for loops, I cannot really understand a thing

So these are the for loops that I have to find the time complexity, but I am not really clearly understood how to calculate.
for (int i = n; i > 1; i /= 3) {
for (int j = 0; j < n; j += 2) {
... ...
}
for (int k = 2; k < n; k = (k * k) {
...
}
For the first line, (int i = n; i > 1; i /= 3), keeps diving i by 3 and if i is less than 1 then the loop stops there, right?
But what is the time complexity of that? I think it is n, but I am not really sure.
The reason why I am thinking it is n is, if I assume that n is 30 then i will be like 30, 10, 3, 1 then the loop stops. It runs n times, doesn't it?
And for the last for loop, I think its time complexity is also n because what it does is
k starts as 2 and keeps multiplying itself to itself until k is greater than n.
So if n is 20, k will be like 2, 4, 16 then stop. It runs n times too.
I don't really think I am understanding this kind of questions because time complexity can be log(n) or n^2 or etc but all I see is n.
I don't really know when it comes to log or square. Or anything else.
Every for loop runs n times, I think. How can log or square be involved?
Can anyone help me understanding this? Please.
Since all three loops are independent of each other, we can analyse them separately and multiply the results at the end.
1. i loop
A classic logarithmic loop. There are countless examples on SO, this being a similar one. Using the result given on that page and replacing the division constant:
The exact number of times that this loop will execute is ceil(log3(n)).
2. j loop
As you correctly figured, this runs O(n / 2) times;
The exact number is floor(n / 2).
3. k loop
Another classic known result - the log-log loop. The code just happens to be an exact replicate of this SO post;
The exact number is ceil(log2(log2(n)))
Combining the above steps, the total time complexity is given by
Note that the j-loop overshadows the k-loop.
Numerical tests for confirmation
JavaScript code:
T = function(n) {
var m = 0;
for (var i = n; i > 1; i /= 3) {
for (var j = 0; j < n; j += 2)
m++;
for (var k = 2; k < n; k = k * k)
m++;
}
return m;
}
M = function(n) {
return ceil(log(n)/log(3)) * (floor(n/2) + ceil(log2(log2(n))));
}
M(n) is what the math predicts that T(n) will exactly be (the number of inner loop executions):
n T(n) M(n)
-----------------------
100000 550055 550055
105000 577555 577555
110000 605055 605055
115000 632555 632555
120000 660055 660055
125000 687555 687555
130000 715055 715055
135000 742555 742555
140000 770055 770055
145000 797555 797555
150000 825055 825055
M(n) matches T(n) perfectly as expected. A plot of T(n) against n log n (the predicted time complexity):
I'd say that is a convincing straight line.
tl;dr; I describe a couple of examples first, I analyze the complexity of the stated problem of OP at the bottom of this post
In short, the big O notation tells you something about how a program is going to perform if you scale the input.
Imagine a program (P0) that counts to 100. No matter how often you run the program, it's going to count to 100 as fast each time (give or take). Obviously right?
Now imagine a program (P1) that counts to a number that is variable, i.e. it takes a number as an input to which it counts. We call this variable n. Now each time P1 runs, the performance of P1 is dependent on the size of n. If we make n a 100, P1 will run very quickly. If we make n equal to a googleplex, it's going to take a little longer.
Basically, the performance of P1 is dependent on how big n is, and this is what we mean when we say that P1 has time-complexity O(n).
Now imagine a program (P2) where we count to the square root of n, rather than to itself. Clearly the performance of P2 is going to be worse than P1, because the number to which they count differs immensely (especially for larger n's (= scaling)). You'll know by intuition that P2's time-complexity is equal to O(n^2) if P1's complexity is equal to O(n).
Now consider a program (P3) that looks like this:
var length= input.length;
for(var i = 0; i < length; i++) {
for (var j = 0; j < length; j++) {
Console.WriteLine($"Product is {input[i] * input[j]}");
}
}
There's no n to be found here, but as you might realise, this program still depends on an input called input here. Simply because the program depends on some kind of input, we declare this input as n if we talk about time-complexity. If a program takes multiple inputs, we simply call those different names so that a time-complexity could be expressed as O(n * n2 + m * n3) where this hypothetical program would take 4 inputs.
For P3, we can discover it's time-complexity by first analyzing the number of different inputs, and then by analyzing in what way it's performance depends on the input.
P3 has 3 variables that it's using, called length, i and j. The first line of code does a simple assignment, which' performance is not dependent on any input, meaning the time-complexity of that line of code is equal to O(1) meaning constant time.
The second line of code is a for loop, implying we're going to do something that might depend on the length of something. And indeed we can tell that this first for loop (and everything in it) will be executed length times. If we increase the size of length, this line of code will do linearly more, thus this line of code's time complexity is O(length) (called linear time).
The next line of code will take O(length) time again, following the same logic as before, however since we are executing this every time execute the for loop around it, the time complexity will be multiplied by it: which results in O(length) * O(length) = O(length^2).
The insides of the second for loop do not depend on the size of the input (even though the input is necessary) because indexing on the input (for arrays!!) will not become slower if we increase the size of the input. This means that the insides will be constant time = O(1). Since this runs in side of the other for loop, we again have to multiply it to obtain the total time complexity of the nested lines of code: `outside for-loops * current block of code = O(length^2) * O(1) = O(length^2).
The total time-complexity of the program is just the sum of everything we've calculated: O(1) + O(length^2) = O(length^2) = O(n^2). The first line of code was O(1) and the for loops were analyzed to be O(length^2). You will notice 2 things:
We rename length to n: We do this because we express
time-complexity based on generic parameters and not on the ones that
happen to live within the program.
We removed O(1) from the equation. We do this because we're only
interested in the biggest terms (= fastest growing). Since O(n^2)
is way 'bigger' than O(1), the time-complexity is defined equal to
it (this only works like that for terms (e.g. split by +), not for
factors (e.g. split by *).
OP's problem
Now we can consider your program (P4) which is a little trickier because the variables within the program are defined a little cloudier than the ones in my examples.
for (int i = n; i > 1; i /= 3) {
for (int j = 0; j < n; j += 2) {
... ...
}
for (int k = 2; k < n; k = (k * k) {
...
}
}
If we analyze we can say this:
The first line of code is executed O(cbrt(3)) times where cbrt is the cubic root of it's input. Since i is divided by 3 every loop, the cubic root of n is the number of times the loop needs to be executed before i is smaller or equal to 1.
The second for loop is linear in time because j is executed
O(n / 2) times because it is increased by 2 rather than 1 which
would be 'normal'. Since we know that O(n/2) = O(n), we can say
that this for loop is executed O(cbrt(3)) * O(n) = O(n * cbrt(n)) times (first for * the nested for).
The third for is also nested in the first for, but since it is not nested in the second for, we're not going to multiply it by the second one (obviously because it is only executed each time the first for is executed). Here, k is bound by n, however since it is increased by a factor of itself each time, we cannot say it is linear, i.e. it's increase is defined by a variable rather than by a constant. Since we increase k by a factor of itself (we square it), it will reach n in 2log(n) steps. Deducing this is easy if you understand how log works, if you don't get this you need to understand that first. In any case, since we analyze that this for loop will be run O(2log(n)) time, the total complexity of the third for is O(cbrt(3)) * O(2log(n)) = O(cbrt(n) *2log(n))
The total time-complexity of the program is now calculated by the sum of the different sub-timecomplexities: O(n * cbrt(n)) + O(cbrt(n) *2log(n))
As we saw before, we only care about the fastest growing term if we talk about big O notation, so we say that the time-complexity of your program is equal to O(n * cbrt(n)).

Get the most occuring number amongst several integers without using arrays

DISCLAIMER: Rather theoretical question here, not looking for a correct answere, just asking for some inspiration!
Consider this:
A function is called repetitively and returns integers based on seeds (the same seed returns the same integer). Your task is to find out which integer is returned most often. Easy enough, right?
But: You are not allowed to use arrays or fields to store return values of said function!
Example:
int mostFrequentNumber = 0;
int occurencesOfMostFrequentNumber = 0;
int iterations = 10000000;
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
int occurencesOfResult = magic();
if(occurencesOfResult > occurencesOfMostFrequentNumber)
{
mostFrequentNumber = result;
occurencesOfMostFrequentNumber = occurencesOfResult;
}
}
If getNumberFromSeed() returns 2,1,5,18,5,6 and 5 then mostFrequentNumber should be 5 and occurencesOfMostFrequentNumber should be 3 because 5 is returned 3 times.
I know this could easily be solved using a two-dimentional list to store results and occurences. But imagine for a minute that you can not use any kind of arrays, lists, dictionaries etc. (Maybe because the system that is running the code has such a limited memory, that you cannot store enough integers at once or because your prehistoric programming language has no concept of collections).
How would you find mostFrequentNumber and occurencesOfMostFrequentNumber? What does magic() do?? (Of cause you do not have to stick to the example code. Any ideas are welcome!)
EDIT: I should add that the integers returned by getNumber() should be calculated using a seed, so the same seed returns the same integer (i.e. int result = getNumber(5); this would always assign the same value to result)
Make an hypothesis: Assume that the distribution of integers is, e.g., Normal.
Start simple. Have two variables
. N the number of elements read so far
. M1 the average of said elements.
Initialize both variables to 0.
Every time you read a new value x update N to be N + 1 and M1 to be M1 + (x - M1)/N.
At the end M1 will equal the average of all values. If the distribution was Normal this value will have a high frequency.
Now improve the above. Add a third variable:
M2 the average of all (x - M1)^2 for all values of xread so far.
Initialize M2 to 0. Now get a small memory of say 10 elements or so. For every new value x that you read update N and M1 as above and M2 as:
M2 := M2 + (x - M1)^2 * (N - 1) / N
At every step M2 is the variance of the distribution and sqrt(M2) its standard deviation.
As you proceed remember the frequencies of only the values read so far whose distances to M1 are less than sqrt(M2). This requires the use of some additional array, however, the array will be very short compared to the high number of iterations you will run. This modification will allow you to guess better the most frequent value instead of simply answering the mean (or average) as above.
UPDATE
Given that this is about insights for inspiration there is plenty of room for considering and adapting the approach I've proposed to any particular situation. Here are some thoughts
When I say assume that the distribution is Normal you should think of it as: Given that the problem has no solution, let's see if there is some qualitative information I can use to decide what kind of distribution would the data have. Given that the algorithm is intended to find the most frequent number, it should be fine to assume that the distribution is not uniform. Let's try with Normal, LogNormal, etc. to see what can be found out (more on this below.)
If the game completely disallows the use of any array, then fine, keep track of only, say 10 numbers. This would allow you to count the occurrences of the 10 best candidates, which will give more confidence to your answer. In doing this choose your candidates around the theoretical most likely value according to the distribution of your hypothesis.
You cannot use arrays but perhaps you can read the sequence of numbers two or three times, not just once. In that case you can read it once to check whether you hypothesis about its distribution is good nor bad. For instance, if you compute not just the variance but the skewness and the kurtosis you will have more elements to check your hypothesis. For instance, if the first reading indicates that there is some bias, you could use a LogNormal distribution instead, etc.
Finally, in addition to providing the approximate answer you would be able to use the information collected during the reading to estimate an interval of confidence around your answer.
Alright, I found a decent solution myself:
int mostFrequentNumber = 0;
int occurencesOfMostFrequentNumber = 0;
int iterations = 10000000;
int maxNumber = -2147483647;
int minNumber = 2147483647;
//Step 1: Find the largest and smallest number that _can_ occur
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
if(result > maxNumber)
{
maxNumber = result;
}
if(result < minNumber)
{
minNumber = result;
}
}
//Step 2: for each possible number between minNumber and maxNumber, count occurences
for(int thisNumber = minNumber; thisNumber <= maxNumber; thisNumber++)
{
int occurenceOfThisNumber = 0;
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
if(result == thisNumber)
{
occurenceOfThisNumber++;
}
}
if(occurenceOfThisNumber > occurencesOfMostFrequentNumber)
{
occurencesOfMostFrequentNumber = occurenceOfThisNumber;
mostFrequentNumber = thisNumber;
}
}
I must admit, this may take a long time, depending on the smallest and largest possible. But it will work without using arrays.