Find first one million prime numbers - objective-c

I want to get the first one million prime numbers.
I know the way of finding small prime numbers. My problem is, how can I store such large numbers in simple data types such as long, int, etc?

Well the millionth prime is less than 16 million, and with the amount of memory in today's computers an ordinary C array of 16 million booleans (you can use 1 byte for each) isn't that large...
So allocate your large array, fill it with true's, treat the first element as representing the integer 2 (i.e. index + 2 is the represented value), and implement the skip n/set false version of the standard sieve. Count the true's as you go and when you get to 1 million you can stop.
There are others ways, but this has the merit of being simple.

You can allocate an array of 1000000 integers - it is only four megabytes, a small number by today's standards. Prime #1000000 should fit in a 32-bit integer (prime #500000 is under 8000000, so 2000000000 should be more than enough of a range for the first 1000000 primes).
You are more likely to encounter issues with the time, not with the space for your computation. Remember that you can stop testing candidate divisors when you reach the square root of the candidate prime, and that you can use the primes that you found so far as your candidate divisors.

Related

Mersenne Primes processing

I took interest in Mersenne Primes https://www.mersenne.org/.
Great Internet Mersenne Prime Search (GIMPS) is doing the research in this field.
These are Prime Numbers but are very large and few.
49th Mersenne Prime is 22 million digits long. It is unbelievable that one number can be 22 million digits.
I tried and could catch up to 8th Mersenne Prime which is 10 digits long and within 2 billions.
I am using Postgres BIGINT which supports up to 19 digit long integers which 9 million billions.
So, if I am processing 1 billion rows at a time, it would take me 9 million iterations.
I can further use NUMERIC data type which supports 131072 digits to left of decimal and a precision 16383 digits. Of course I need to work with integers only. I do not need precision.
Another alternative is Postgres's CHAR VARYING which stores up to a billion. But it can not be used for calculations.
What Postgres provides is enough for any practical needs.
My question is how the guys at GIMPS are calculating such large numbers.
Are they storing these numbers in any database. Which database supports such large numbers.
Am I out of sync with progresses made in database world.
I know they have huge processing power Curtis Cooper has mentioned 700 servers are being used to discover and verify the numbers.
Exactly how much storage it is needed. What language is being used.
Just curiosity. Does this sound like I am out of job.
thanks
bb23850
Mersenne numbers are very easy to calculate. They are always one less than a power of 2:
select n, cast(power(cast(2 as numeric), n) - 1 as numeric(1000,0))
from generate_series(1, 100, 1) gs(n)
order by n;
The challenge is determining whether or not the resulting number is a prime. Mersenne knew that n needs to be prime for the number corresponding Mersenne number to be prime.
As fast a computers are, once the number has more than a dozen or two dozen or so digits, an exhaustive search of all factors is not feasible. You can see from the above code that an exhaustive search becomes infeasible long before the 100th Mersenne number.
In order to determine if such a number is prime, a lot of mathematics is used -- some of it invented for or inspired by this particular problem. I'm pretty sure that it would be quite hard to implement any of those primality tests in a relational database.

Solitaire: storing guaranteed wins cheaply

Given a list of deals of Klondike Solitaire that are known to win, is there a way to store a reasonable amount of deals (say 10,000+) in a reasonable amount of space (say 5MB) to retrieve on command? (These numbers are arbitrary)
I thought of using a pseudo random generator where a given seed would generate a decimal string of numbers, where each two digits represents a card, and the index represents the location of the deal. In this case, you would only have to store the seed and the PRG code.
The only cons I can think of would be that A) the number of possible deals is 52!, and so the number of possible seeds would be at least 52!, and would be monstrous to store in the higher number range, and B) the generated number can't repeat a two digit number (though they can be ignored in the deck construction)
Given no prior information, the theoretical limit on how compactly you can represent an ordered deck of cards is 226 bits. Even the simple naive 6-bits-per card is only 312 bits, so you probably won't gain much by being clever.
If you're willing to sacrifice a large part of the state-space, you could use a 32- or 64-bit PRNG to generate the decks, and then you could reproduce them from the 32- or 64-bit initial PRNG state. But that limits you to 2^64 different decks out of the possible 2^225+.
If you are asking hypothetically, I would say that you would need at least 3.12 MB to store 10,000 possible deals. You need 6 bits to represent each card (assuming you number them 1-52) and then you would need to order them so 6 * 52 = 312. Take that and multiply it by the number of deals 312 * 10,000 and you get 3,120,000 bits or 3.12 MB.

Where does the number 2 exp 32 come from?

A REDIS instance can store 2exp32 keys. A REIS set can store 2exp32 entries. Where does this number come from?
232 − 1 is the largest number representable by an unsigned 32-bit integer. For many years, 32 bits was the most common size of a CPU integer register, so that upper limit is baked into programs all over the place. Nowadays CPUs with 64-bit-wide integer registers are increasingly common, but programmers still often reach for the 32-bit integer first.
2^32 = 4,294,967,296, which is the value you can fit in a 32-bit data type. Sets of 250 million keys have been tested.
More information on their page
2^32 is the number of possible values that can be represented by an 4 byte integer.
2^32 is a 32-bit number. 32-bit is a common size for many database systems.

Print a number in decimal

Well, it is a low-level question
Suppose I store a number (of course computer store number in binary format)
How can I print it in decimal format. It is obvious in high-level program, just print it and the library does it for you.
But how about in a very low-level situation where I don't have this library.
I can just tell what 'character' to output. How to convert the number into decimal characters?
I hope you understand my question. Thank You.
There are two ways of printing decimals - on CPUs with division/remainder instructions (modern CPUs are like that) and on CPUs where division is relatively slow (8-bit CPUs of 20+ years ago).
The first method is simple: int-divide the number by ten, and store the sequence of remainders in an array. Once you divided the number all the way to zero, start printing remainders starting from the back, adding the ASCII code of zero ('0') to each remainder.
The second method relies on the lookup table of powers of ten. You define an array of numbers like this:
int pow10 = {10000,1000,100,10,1}
Then you start with the largest power, and see if you can subtract it from the number at hand. If you can, keep subtracting it, and keep the count. Once you cannot subtract it without going negative, print the count plus the ASCII code of zero, and move on to the next smaller power of ten.
If integer, divide by ten, get both the result and the remainder. Repeat the process on the result until zero. The remainders will give you decimal digits from right to left. Add 48 for ASCII representation.
Basically, you want to tranform a number (stored in some arbitrary internal representation) into its decimal representation. You can do this with a few simple mathematical operations. Let's assume that we have a positive number, say 1234.
number mod 10 gives you a value between 0 and 9 (4 in our example), which you can map to a character¹. This is the rightmost digit.
Divide by 10, discarding the remainder (an operation commonly called "integer division"): 1234 → 123.
number mod 10 now yields 3, the second-to-rightmost digit.
continue until number is zero.
Footnotes:
¹ This can be done with a simple switch statement with 10 cases. Of course, if your character set has the characters 0..9 in consecutive order (like ASCII), '0' + number suffices.
It doesnt matter what the number system is, decimal, binary, octal. Say I have the decimal value 123 on a decimal computer, I would still need to convert that value to three characters to display them. Lets assume ASCII format. By looking at an ASCII table we know the answer we are looking for, 0x31,0x32,0x33.
If you divide 123 by 10 using integer math you get 12. Multiply 12*10 you get 120, the difference is 3, your least significant digit. we go back to the 12 and divide that by 10, giving a 1. 1 times 10 is 10, 12-10 is 2 our next digit. we take the 1 that is left over divide by 10 and get zero we know we are now done. the digits we found in order are 3, 2, 1. reverse the order 1, 2, 3. Add or OR 0x30 to each to convert them from integers to ascii.
change that to use a variable instead of 123 and use any numbering system you like so long as it has enough digits to do this kind of work
You can go the other way too, divide by 100...000, whatever the largest decimal you can store or intend to find, and work your way down. In this case the first non zero comes with a divide by 100 giving a 1. save the 1. 1 times 100 = 100, 123-100 = 23. now divide by 10, this gives a 2, save the 2, 2 times 10 is 20. 23 - 20 = 3. when you get to divide by 1 you are done save that value as your ones digit.
here is another given a number of seconds to convert to say hours and minutes and seconds, you can divide by 60, save the result a, subtract the original number - (a*60) giving your remainder which is seconds, save that. now take a and divide by 60, save that as b, this is your number of hours. subtract a - (b*60) this is the remainder which is minutes save that. done hours, minutes seconds. you can then divide the hours by 24 to get days if you want and days and then that by 7 if you want weeks.
A comment about divide instructions was brought up. Divides are very expensive and most processors do not have one. Expensive in that the divide, in a single clock, costs you gates and power. If you do the divide in many clocks you might as well just do a software divide and save the gates. Same reason most processors dont have an fpu, gates and power. (gates mean larger chips, more expensive chips, lower yield, etc). It is not a case of modern or old or 64 bit vs 8 bit or anything like that it is an engineering and business trade off. the 8088/86 has a divide with a remainder for example (it also has a bcd add). The gates/size if used might be better served than for a single instruction. Multiply falls into that category, not as bad but can be. If operand sizes are not done right you can make either instruction (family) not as useful to a programmer. Which brings up another point, I cant find the link right now but a way to avoid divides but convert from a number to a string of decimal digits is that you can multiply by .1 using fixed point. I also cant find the quote about real programmers not needing floating point related to keeping track of the decimal point yourself. its the slide rule vs calculator thing. I believe the link to the article on dividing by 10 using a multiply is somewhere on stack overflow.

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.
That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.
Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.
After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.