Mersenne Primes processing - sql

I took interest in Mersenne Primes https://www.mersenne.org/.
Great Internet Mersenne Prime Search (GIMPS) is doing the research in this field.
These are Prime Numbers but are very large and few.
49th Mersenne Prime is 22 million digits long. It is unbelievable that one number can be 22 million digits.
I tried and could catch up to 8th Mersenne Prime which is 10 digits long and within 2 billions.
I am using Postgres BIGINT which supports up to 19 digit long integers which 9 million billions.
So, if I am processing 1 billion rows at a time, it would take me 9 million iterations.
I can further use NUMERIC data type which supports 131072 digits to left of decimal and a precision 16383 digits. Of course I need to work with integers only. I do not need precision.
Another alternative is Postgres's CHAR VARYING which stores up to a billion. But it can not be used for calculations.
What Postgres provides is enough for any practical needs.
My question is how the guys at GIMPS are calculating such large numbers.
Are they storing these numbers in any database. Which database supports such large numbers.
Am I out of sync with progresses made in database world.
I know they have huge processing power Curtis Cooper has mentioned 700 servers are being used to discover and verify the numbers.
Exactly how much storage it is needed. What language is being used.
Just curiosity. Does this sound like I am out of job.
thanks
bb23850

Mersenne numbers are very easy to calculate. They are always one less than a power of 2:
select n, cast(power(cast(2 as numeric), n) - 1 as numeric(1000,0))
from generate_series(1, 100, 1) gs(n)
order by n;
The challenge is determining whether or not the resulting number is a prime. Mersenne knew that n needs to be prime for the number corresponding Mersenne number to be prime.
As fast a computers are, once the number has more than a dozen or two dozen or so digits, an exhaustive search of all factors is not feasible. You can see from the above code that an exhaustive search becomes infeasible long before the 100th Mersenne number.
In order to determine if such a number is prime, a lot of mathematics is used -- some of it invented for or inspired by this particular problem. I'm pretty sure that it would be quite hard to implement any of those primality tests in a relational database.

Related

Solitaire: storing guaranteed wins cheaply

Given a list of deals of Klondike Solitaire that are known to win, is there a way to store a reasonable amount of deals (say 10,000+) in a reasonable amount of space (say 5MB) to retrieve on command? (These numbers are arbitrary)
I thought of using a pseudo random generator where a given seed would generate a decimal string of numbers, where each two digits represents a card, and the index represents the location of the deal. In this case, you would only have to store the seed and the PRG code.
The only cons I can think of would be that A) the number of possible deals is 52!, and so the number of possible seeds would be at least 52!, and would be monstrous to store in the higher number range, and B) the generated number can't repeat a two digit number (though they can be ignored in the deck construction)
Given no prior information, the theoretical limit on how compactly you can represent an ordered deck of cards is 226 bits. Even the simple naive 6-bits-per card is only 312 bits, so you probably won't gain much by being clever.
If you're willing to sacrifice a large part of the state-space, you could use a 32- or 64-bit PRNG to generate the decks, and then you could reproduce them from the 32- or 64-bit initial PRNG state. But that limits you to 2^64 different decks out of the possible 2^225+.
If you are asking hypothetically, I would say that you would need at least 3.12 MB to store 10,000 possible deals. You need 6 bits to represent each card (assuming you number them 1-52) and then you would need to order them so 6 * 52 = 312. Take that and multiply it by the number of deals 312 * 10,000 and you get 3,120,000 bits or 3.12 MB.

Find first one million prime numbers

I want to get the first one million prime numbers.
I know the way of finding small prime numbers. My problem is, how can I store such large numbers in simple data types such as long, int, etc?
Well the millionth prime is less than 16 million, and with the amount of memory in today's computers an ordinary C array of 16 million booleans (you can use 1 byte for each) isn't that large...
So allocate your large array, fill it with true's, treat the first element as representing the integer 2 (i.e. index + 2 is the represented value), and implement the skip n/set false version of the standard sieve. Count the true's as you go and when you get to 1 million you can stop.
There are others ways, but this has the merit of being simple.
You can allocate an array of 1000000 integers - it is only four megabytes, a small number by today's standards. Prime #1000000 should fit in a 32-bit integer (prime #500000 is under 8000000, so 2000000000 should be more than enough of a range for the first 1000000 primes).
You are more likely to encounter issues with the time, not with the space for your computation. Remember that you can stop testing candidate divisors when you reach the square root of the candidate prime, and that you can use the primes that you found so far as your candidate divisors.

Print a number in decimal

Well, it is a low-level question
Suppose I store a number (of course computer store number in binary format)
How can I print it in decimal format. It is obvious in high-level program, just print it and the library does it for you.
But how about in a very low-level situation where I don't have this library.
I can just tell what 'character' to output. How to convert the number into decimal characters?
I hope you understand my question. Thank You.
There are two ways of printing decimals - on CPUs with division/remainder instructions (modern CPUs are like that) and on CPUs where division is relatively slow (8-bit CPUs of 20+ years ago).
The first method is simple: int-divide the number by ten, and store the sequence of remainders in an array. Once you divided the number all the way to zero, start printing remainders starting from the back, adding the ASCII code of zero ('0') to each remainder.
The second method relies on the lookup table of powers of ten. You define an array of numbers like this:
int pow10 = {10000,1000,100,10,1}
Then you start with the largest power, and see if you can subtract it from the number at hand. If you can, keep subtracting it, and keep the count. Once you cannot subtract it without going negative, print the count plus the ASCII code of zero, and move on to the next smaller power of ten.
If integer, divide by ten, get both the result and the remainder. Repeat the process on the result until zero. The remainders will give you decimal digits from right to left. Add 48 for ASCII representation.
Basically, you want to tranform a number (stored in some arbitrary internal representation) into its decimal representation. You can do this with a few simple mathematical operations. Let's assume that we have a positive number, say 1234.
number mod 10 gives you a value between 0 and 9 (4 in our example), which you can map to a character¹. This is the rightmost digit.
Divide by 10, discarding the remainder (an operation commonly called "integer division"): 1234 → 123.
number mod 10 now yields 3, the second-to-rightmost digit.
continue until number is zero.
Footnotes:
¹ This can be done with a simple switch statement with 10 cases. Of course, if your character set has the characters 0..9 in consecutive order (like ASCII), '0' + number suffices.
It doesnt matter what the number system is, decimal, binary, octal. Say I have the decimal value 123 on a decimal computer, I would still need to convert that value to three characters to display them. Lets assume ASCII format. By looking at an ASCII table we know the answer we are looking for, 0x31,0x32,0x33.
If you divide 123 by 10 using integer math you get 12. Multiply 12*10 you get 120, the difference is 3, your least significant digit. we go back to the 12 and divide that by 10, giving a 1. 1 times 10 is 10, 12-10 is 2 our next digit. we take the 1 that is left over divide by 10 and get zero we know we are now done. the digits we found in order are 3, 2, 1. reverse the order 1, 2, 3. Add or OR 0x30 to each to convert them from integers to ascii.
change that to use a variable instead of 123 and use any numbering system you like so long as it has enough digits to do this kind of work
You can go the other way too, divide by 100...000, whatever the largest decimal you can store or intend to find, and work your way down. In this case the first non zero comes with a divide by 100 giving a 1. save the 1. 1 times 100 = 100, 123-100 = 23. now divide by 10, this gives a 2, save the 2, 2 times 10 is 20. 23 - 20 = 3. when you get to divide by 1 you are done save that value as your ones digit.
here is another given a number of seconds to convert to say hours and minutes and seconds, you can divide by 60, save the result a, subtract the original number - (a*60) giving your remainder which is seconds, save that. now take a and divide by 60, save that as b, this is your number of hours. subtract a - (b*60) this is the remainder which is minutes save that. done hours, minutes seconds. you can then divide the hours by 24 to get days if you want and days and then that by 7 if you want weeks.
A comment about divide instructions was brought up. Divides are very expensive and most processors do not have one. Expensive in that the divide, in a single clock, costs you gates and power. If you do the divide in many clocks you might as well just do a software divide and save the gates. Same reason most processors dont have an fpu, gates and power. (gates mean larger chips, more expensive chips, lower yield, etc). It is not a case of modern or old or 64 bit vs 8 bit or anything like that it is an engineering and business trade off. the 8088/86 has a divide with a remainder for example (it also has a bcd add). The gates/size if used might be better served than for a single instruction. Multiply falls into that category, not as bad but can be. If operand sizes are not done right you can make either instruction (family) not as useful to a programmer. Which brings up another point, I cant find the link right now but a way to avoid divides but convert from a number to a string of decimal digits is that you can multiply by .1 using fixed point. I also cant find the quote about real programmers not needing floating point related to keeping track of the decimal point yourself. its the slide rule vs calculator thing. I believe the link to the article on dividing by 10 using a multiply is somewhere on stack overflow.

How many iterations of Rabin-Miller should I use for cryptographic safe primes?

I am generating a 2048-bit safe prime for a Diffie-Hellman-type key, p such that p and (p-1)/2 are both prime.
How few iterations of Rabin-Miller can I use on both p and (p-1)/2 and still be confident of a cryptographically strong key? In the research I've done I've heard everything from 6 to 64 iterations for 1024-bit ordinary primes, so I'm a little confused at this point. And once that's established, does the number change if you are generating a safe prime rather than an ordinary one?
Computation time is at a premium, so this is a practical question - I'm basically wondering how to find out the lowest possible number of tests I can get away with while at the same time maintaining pretty much guaranteed security.
Let's assume that you select a prime p by selecting random values until you hit one for which Miller-Rabin says: that one looks like a prime. You use n rounds at most for the Miller-Rabin test. (For a so-called "safe prime", things are are not changed, except that you run two nested tests.)
The probability that a random 1024-bit integer is prime is about 1/900. Now, you do not want to do anything stupid so you generate only odd values (an even 1024-bit integer is guaranteed non-prime), and, more generally, you run the Miller-Rabin test only if the value is not "obviously" non-prime, i.e. can be divided by a small prime. So you end up with trying about 300 values with Miller-Rabin before hitting a prime (on average). When the value is non-prime, Miller-Rabin will detect it with probability 3/4 at each round, so the number of Miller-Rabin rounds you will run on average for a single non-prime value is 1+(1/4)+(1/16)+... = 4/3. For the 300 values, this means about 400 rounds of Miller-Rabin, regardless of what you choose for n.
So if you select n to be, e.g., 40, then the cost implied by n is less than 10% of the total computational cost. The random prime selection process is dominated by the test on non-primes, which are not impacted by the value of n you choose. I talked here about 1024-bit integers; for bigger numbers the choice of n is even less important since primes become sparser as size increases (for 2048-bit integers, the "10%" above become "5%").
Hence you can choose n=40 and be happy with it (or at least know that reducing n will not buy you much anyway). On the other hand, using a n greater than 40 is meaningless, because this would get you to probabilities lower than the risk of a simple miscomputation. Computers are hardware, they can have random failures. For instance, a primality test function could return "true" for a non-prime value because a cosmic ray (a high-energy particle hurtling through the Universe at high speed) happens to hit just the right transistor at the right time, flipping the return value from 0 ("false") to 1 ("true"). This is very unlikely -- but no less likely than probability 2-80. See this stackoverflow answer for a few more details. The bottom line is that regardless of how you make sure that an integer is prime, you still have an unavoidable probabilistic element, and 40 rounds of Miller-Rabin already give you the best that you can hope for.
To sum up, use 40 rounds.
The paper Average case error estimates for the strong probable prime test by Damgard-Landrock-Pomerance points out that, if you randomly select k-bit odd number n and apply t independent Rabin-Miller tests in succession, the probability that n is a composite has much stronger bounds.
In fact for 3 <= t <= k/9 and k >= 21,
For a k=1024 bit prime, t=6 iterations give you an error rate less than 10^(-40).
Each iteration of Rabin-Miller reduces the odds that the number is composite by a factor of 1/4.
So after 64 iterations, there is only 1 chance in 2^128 that the number is composite.
Assuming you are using these for a public key algorithm (e.g. RSA), and assuming you are combining that with a symmetric algorithm using (say) 128-bit keys, an adversary can guess your key with that probability.
The bottom line is to choose the number of iterations to put that probability within the ballpark of the other sizes you are choosing for your algorithm.
[update, to elaborate]
The answer depends entirely on what algorithms you are going to use the numbers for, and what the best known attacks are against those algorithms.
For example, according to Wikipedia:
As of 2003 RSA Security claims that 1024-bit RSA keys are equivalent in strength to 80-bit symmetric keys, 2048-bit RSA keys to 112-bit symmetric keys and 3072-bit RSA keys to 128-bit symmetric keys.
So, if you are planning to use these primes to generate (say) a 1024-bit RSA key, then there is no reason to run more than 40 iterations or so of Rabin-Miller. Why? Because by the time you hit a failure, an attacker could crack one of your keys anyway.
Of course, there is no reason not to perform more iterations, time permitting. There just isn't much point to doing so.
On the other hand, if you are generating 2048-bit RSA keys, then 56 (or so) iterations of Rabin-Miller is more appropriate.
Cryptography is typically built as a composition of primitives, like prime generation, RSA, SHA-2, and AES. If you want to make one of those primitives 2^900 times stronger than the others, you can, but it is a little like putting a 10-foot-steel vault door on a log cabin.
There is no fixed answer to your question. It depends on the strength of the other pieces going into your cryptographic system.
All that said, 2^-128 is a ludicrously tiny probability, so I would probably just use 64 iterations :-).
From the libgcrypt source:
/* We use 64 Rabin-Miller rounds which is better and thus
sufficient. We do not have a Lucas test implementaion thus we
can't do it in the X9.31 preferred way of running a few
Rabin-Miller followed by one Lucas test. */
cipher/primegen.c line# 1295
I would run two or three iterations of Miller-Rabin (i.e., strong Fermat probable prime) tests, making sure that one of the bases is 2.
Then I would run a strong Lucas probable prime test, choosing D, P, and Q with the method described here:
https://en.wikipedia.org/wiki/Baillie%E2%80%93PSW_primality_test
There are no known composites that pass this combination of Fermat and Lucas tests.
This is much faster than doing 40 Rabin-Miller iterations. In addition, as was pointed out by Pomerance, Selfridge, and Wagstaff in https://math.dartmouth.edu/~carlp/PDF/paper25.pdf, there are diminishing returns with multiple Fermat tests: if N is a pseudoprime to one base, then it is more likely than the average number to be a pseudoprime to other bases. That's why, for example, we see so many psp's base 2 are also psp's base 3.
A smaller probability is usually better, but I would take the actual probability value with a grain of salt. Albrecht et al Prime and Prejudice: Primality Testing Under Adversarial Conditions break a number of prime-testing routines in cryptographic libraries. In one example, the published probability is 1/2^80, but the number they construct is declared prime 1 time out of 16.
In several other examples, their number passes 100% of the time.
Only 2 iterations, assuming 2^-80 as a negligibly probability.
From (Alfred. J. Menezes. et al. 1996) §4.4 p.148:
Does it matter? Why not run for 1000 iterations? When searching for primes, you stop applying the Rabin-Miller test anyway the first time it fails, so for the time it takes to find a prime it doesn't really matter what the upper bound on the number of iterations is. You could even run a deterministic primality checking algorithm after those 1000 iterations to be completely sure.
That said, the probability that a number is prime after n iterations is 4^-n.

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.
That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.
Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.
After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.