how to find factors of very big number [closed] - bignum

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
i need to find factors of very big number say (10^1000) . i.e if input is 100 then output should be 10 10 because (10*10=100) .this is very simple if N<=size of (long) but i want to know how it will be possible to find factors of very big number say (10^1000). also i cant use Big Integer .
.

1) As has been pointed out, factoring large numbers is hard. It is in fact sufficiently hard that it's the basis for RSA public key cryptography, or in other words every time you buy something online, you are counting on the fact that it's hard to factor numbers of the order 2^2048 (given 2^10 = 1024 which is about 10^3, 2^2048 is about 10^600). While RSA specifically uses two large prime numbers and your random N may have lots of small numbers which will help somewhat, I wouldn't count on being able to factor 10^1000 +/- some random value anytime soon.
2) You can definitely reimplement big number library using strings [source: I had a classmate who did it before we learned about how to do big number math] but it's going to be painfully slow, and you basically have to cast your strings back to ints each time; a slightly less painful approach if you wanted to reimplmeent big numbers is arrays of integers. You still need to do some extra steps, but for doing at least basic math, it's not super difficult. (But it still won't be as efficient as specialized big number libraries, which can do clever algorithms. For example, multiplying 2 big numbers the straight forward way would be let A = P * 2^32 + Q (i.e. A is a 64 bit number represented as an array of 2 32 bit numbers) and B = R * 2^32 + S... the straightforward way takes 4 multiplactions plus some additions plus some dealing with carries). As the size of the big number increases, there are ways (see e.g. http://en.wikipedia.org/wiki/Karatsuba_algorithm) to reduce the number of multipication required)
3) (There are algorithms to more efficiently factor numbers compared to trial factorization, but the current ones are still not going to help compute the numbers you're asking about before the heat death of the universe)

10^1000 has exactly 1,002,001 integer divisors, and they should be very easy to find with a bit of thinking. The prime factorisation is
2 * 2 * 2 * ... * 5 * 5 * 5
with exactly 1,000 twos and exactly 1,000 fives.

Related

Mersenne Primes processing

I took interest in Mersenne Primes https://www.mersenne.org/.
Great Internet Mersenne Prime Search (GIMPS) is doing the research in this field.
These are Prime Numbers but are very large and few.
49th Mersenne Prime is 22 million digits long. It is unbelievable that one number can be 22 million digits.
I tried and could catch up to 8th Mersenne Prime which is 10 digits long and within 2 billions.
I am using Postgres BIGINT which supports up to 19 digit long integers which 9 million billions.
So, if I am processing 1 billion rows at a time, it would take me 9 million iterations.
I can further use NUMERIC data type which supports 131072 digits to left of decimal and a precision 16383 digits. Of course I need to work with integers only. I do not need precision.
Another alternative is Postgres's CHAR VARYING which stores up to a billion. But it can not be used for calculations.
What Postgres provides is enough for any practical needs.
My question is how the guys at GIMPS are calculating such large numbers.
Are they storing these numbers in any database. Which database supports such large numbers.
Am I out of sync with progresses made in database world.
I know they have huge processing power Curtis Cooper has mentioned 700 servers are being used to discover and verify the numbers.
Exactly how much storage it is needed. What language is being used.
Just curiosity. Does this sound like I am out of job.
thanks
bb23850
Mersenne numbers are very easy to calculate. They are always one less than a power of 2:
select n, cast(power(cast(2 as numeric), n) - 1 as numeric(1000,0))
from generate_series(1, 100, 1) gs(n)
order by n;
The challenge is determining whether or not the resulting number is a prime. Mersenne knew that n needs to be prime for the number corresponding Mersenne number to be prime.
As fast a computers are, once the number has more than a dozen or two dozen or so digits, an exhaustive search of all factors is not feasible. You can see from the above code that an exhaustive search becomes infeasible long before the 100th Mersenne number.
In order to determine if such a number is prime, a lot of mathematics is used -- some of it invented for or inspired by this particular problem. I'm pretty sure that it would be quite hard to implement any of those primality tests in a relational database.

What is the reason behind precision and scale naming? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
I have a lot of troubles understanding why precision and scale are called that way in database types.
I see precision and scale differently.
PRECISION IN MY POINT OF VIEW
For me precision would be how many digits there are in the right side. For instance, 1 is less precise than 1,0001.
SCALE IN MY POINT OF VIEW
Scale would be how much a number can go up or down.
For instance 0 - 1000 is a bigger scale than 0 - 10.
Or even 0 - 1,0 is a bigger scale than 0 - 1.
PRECISION AND SCALE IN DATABASE
However in database lexicon it has different meaning, precision is the total of digits in a number and scale the number of digits on right.
I'm always forgetting the meaning of this two words because I can't make sense of them.
Hope you guys can help me out, understanding why they are called this way
Perhaps it's easier if you consider how the numbers look in scientific notation.
X * 10^Y
where X has a single digit before the decimal point.
Now, how big the number is (its "scale") is fundamentally determined by Y. Are we counting in ones? Millions? Thousandths? That's scale.
Regardless of the absolute scale of the number, the digits in X determine how precise we're being. Can I distinguish 1.1 ones from 1.2 ones? Can I distinguish 1.1 millions from 1.2 millions? Can I distinguish 1.1 thousandths from 1.2 thousandths? All are equivalent - two digits (including the one before the decimal point) of precision.
If I can distinguish 1.01 millions from 1.00 millions, that's more precise than only being able to distinguish 0.1 millions; that's 3 digits of precision.
But 1.01*10^-3 is not more precise than 1.01*10^10 ; it merely operates at a smaller scale.
Beyond that, I don't know what you want. Ok, you've told us what you'd like the words to mean; but that's not what they mean. This is what they mean.
UPDATE - One other thing I should mention. It may seem that scale and precision are conflated in some way, because if we take a physical example, surely "1 millimeter from the bullseye" is more precise than "1 meter from the bullseye", right?
But remember that precision and scale describe a variable's data type, not a specific measurement. If measuring in meters, we can't express "1 millimeter from the bullseye" with less than 4 digits of precision ("0.001 meters"); but we could describe "1 meter from the bullseye" with 1 digit of precision. So that actually does align with our desire to call "1 mm from the bullseye" somehow more precise.

Limiting chosen variables solved for in opensolver [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I've got a linear system of 17 equations, 506 variables that solve for a minimum summation of the total variables. This works fine, so far, but the solution is a result of a combination of 19 variables.
But in the end I want to limit the amount of chosen variables to 10, without knowing in advance which ones are the optimal ones (The solver figures that out for me, as well as their ratio).
I figured I can set a boolean = 1 if the value becomes larger than 0: (meaning the variable is picked), and 0 if the variable is not picked for an optimal solution.
And then have the sum of the booleans be 10 at most.
However this seems a bit elaborate, and I was wondering whether there was a built in option in the opensolver, for I think it is quite a common problem to solve a large set with a subset.
So does anyone have a suggestion on:
How my elaborate way drastically decreases performance? (*I have no intrinsic comprehension of the opensolver algorithms, yet.)
A suggestion to more easily/within the opensolver options account for my desire of max. 10 solution variables?
Based on the information provided below, I first scaled down the size of the problem:
I have three lists of data with 18 entries in columns:
W7:W23,AC7:AD23
which manually (with: W28 = 6000, AC28=600,W29 = 1,AC29 =1), in a linear combination,equal/exceed the target list:
EGM34:EG50
So what I did was put the descion variables in W28:W29, AC28:AD29
Where I added the constraint W28,AC28:AD28 = integer in the solver (both the original excel solver as in opensolver)
And I added the constraint W29,AC29:AD29 = Boolean in the solver (both the original excel solver as in opensolver)
Then I have a multiplication of the integer*boolean = the actual multiplication factor for the above lists in (W7:W23 etc)
In order to limit the nr of chosen variables I have also tried, in addition to the described constraints, to limit the cell with =sum(W29,AC29:AD29) to <= 10 (effectively reducing the amount of booleans set to true below 11, or so I thought, but the booleans aren't evaluated as booleans by the solver).
These new multiplied lists are placed in W34:W50,AC34:AD50, and the summation is situated in: EGY34:EGY50 Hence the final check is added as a constraint as:
EGY34:EGY50 =>EGM34:EGM50
And I had a question about how the linear solver evaluates these constraints, does it:
a. Think the sum of EGY34:EGY50 must be larger or equal than/to EGM34:EGM50
or
b. Does it think: "for every row x EGYx must be larger or equal than/to EGMx
So far I've noted b. but I would like to make sure.
But my main question concerns:
After using the Evolutionary algorithm as was kindly suggested in the comments below, how/why does it try values as 0.99994 for the desicion variables designated as booleans?
The introduction of binary variables is indeed the standard way to implement such constraints. Unfortunately, it transforms the problem from being a linear programming problem to being an integer programming problem (specifically a mixed integer linear programming problem). A standard approach to such problems is the branch and bound algorithm. This is what Excel's built-in solver seems to use, I'm not sure about the open solver that you are using. In the best case (where there is a lot of bounding) it will run fairly rapidly, even with problems of your size. In the worst case, for your problem it could be little better than what you would get by running the simplex algorithm C(506,10) = 2.8 x 10^20 times (once for each possible set of 10 decision variables). In other words, it might be infeasible. Integer programming is known to be NP-hard.
If an exact solution is infeasible, you could always use a heuristic algorithm such as an evolutionary approach.

Storing and computing with real numbers up to an arbitrary precision in vb.net [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
.NET Framework Library for arbitrary digit precision
How can I store a real number, eg, root 2 or one third, up to an arbitrary precision (the precision I need is infinate precision) in vb.net?
I would like to be able to store real numbers and perform operations on them (ie root 2 times root 2) without losing any accuracy - IE storing 1/3 would return the value 1/3 if I needed to retrieve this value.
I was thinking of using a fractal encoding but I am unsure as to the best way to do this.
Storage capacity is not an issue, I just need the real numbers to be 100% accurate.
Will that be a single real number there or does it need to be an arbitrary number of (almost) arbitrary figures? (Sorry for "answer" - for some reason i can't add comments now...)

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.
That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.
Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.
After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.