In explanations I've read about public key cryptography, it is said that some large number is come up with by multiplying together 2 extremely large primes. Since factoring the product of large primes is almost impossibly time-consuming, you have security.
This seems like a problem that could be trivially solved with rainbow tables. If you know the approximate size of primes used and know there are 2 of them, you could quickly construct a rainbow table. It'd be a mighty large table, but it could be done and the task could be parallelized across hardware.
Why are rainbow tables not an effective way to beat public key crypto based on multiplying large primes?
Disclaimer: obviously tens of thousands of crazy-smart security conscious people didn't just happen to miss for decades what I thought up in an afternoon. I assume I'm misunderstanding this because I was reading simplified layman explanations (eg: if more than 2 numbers are used) but I don't know enough yet to know where my knowledge gap is.
Edit: I know "rainbow table" relates to using pre-calculated hashes in a lookup table but the above sounds like a rainbow table attack so I'm using the term here.
Edit 2: As noted in the answers, there's no way to store just all of the primes, much less all of their products.
This site says there are about this many 512 bit primes: ((2^511) * 1) / (512 log(2)) = 4.35 × 10151
The mass of the sun is 2 × 1030 kg or 2 × 1033 g
That's 2.17 × 10124 primes per gram of the sun.
Qty. of 512 bit numbers that can fit in a kilobyte: 1 kb = 1024 bytes = 8192 bits / 512 = 16
That can fit in a terabyte: 16*1024*1024*1024 = 1.72 × 1010
Petabyte: 16*1024*1024*1024*1024 = 1.72 × 1013
Exabyte: 16*1024*1024*1024*1024*1024 = 1.72 × 1016
Even if 1 exabyte weighed 1 gram, we're nowhere close to reaching the 2.17 × 10124 needed to be able to fit all of these numbers into a hard drive with the mass of the sun
From one of my favorite books ever, Applied Cryptography by Bruce Schneier
"If someone created a database of all primes, won't he be
able to use that database to break public-key algorithms?
Yes, but he can't do it. If you could store one gigabyte
of information on a drive weighing one gram, then a list
of just the 512-bit primes would weigh so much that it
would exceed the Chandrasekhar limit and collapse into a
black hole... so you couldn't retrieve the data anyway"
In other words, it's impossible or unfeasible, or both.
The primes used in RSA and Diffie-Hellman are typically on the order of 2512. In comparison, there are only about 2256 atoms in the known universe. That means 2512 is large enough to assign 2256 unique numbers to every atom in the universe.
There is simply no way to store/calculate that much data.
As an aside, I assume you mean "a large table of primes" - rainbow tables are specificly tailored for hashes, and have no real meaning here.
I think the main problem is that rainbow tables pregenerated for certain algorithms use a rather "small" range (usually something in the range of 128 bits). This doesn't usually cover the whole range, but speeds the brute force process up. They usually consume some TB of space.
In prime factorization, primes are much larger (for secure RSA, 2048 bits are recommended). So the rainbow tables wouldn't be "mighty large", but impossible to store anywhere (using up like millions of TB of space).
Also, rainbow tables use hash chains too further speed up the process (Wikipedia has a good explanation) which can't be used for primes.
Related
I want to hash an internal account number and use the result as a unique public identifier for an account record. The identifier is limited to 40 characters. I have approximately 250 records with unique account numbers.
What is less likely to result in a collision.
Taking the SHA-1 of the SHA-256 hash of the account number.
Taking the SHA-256 of the account number and picking out 40 characters.
These approaches are identical (*), so you should use the second one. There is no reason to inject SHA-1 into the system. Any selection of bits out of SHA-256 are independent and "effectively random."
An alternate solution that may be convenient is to turn these into v5 UUIDs. If you keep your namespace secret (which is allowed), this may be a very nice way to do what you're describing.
(*) There are some subtleties around the fact that your using "characters" rather than bytes here, and you could get a larger space in 40 "characters" by using a better encoding than you're likely using. It's possible the spaces are a little different based on how you're actually encoding. But it deeply doesn't matter. These spaces are enormous, and the two approaches will be the same in practice, so use the one that only needs one algorithm.
Another approach that may meet your needs better is to stretch the identifiers. If the space is sufficiently sparse (i.e if the number of possible identifiers is dramatically larger than the number of actually used identifiers), then stretching algorithms like PBKDF2 are designed to handle exactly that. They are expensive to compute, but you can tune their cost to match your security requirements.
The general problem with just hashing is that hashes are very fast, and if your space of possible identifiers is very small, then it's easy to brute force. Stretching algorithms make the cost of guessing arbitrarily expensive, so large spaces are impractical to brute force. They do this without requiring any secrets, which is nice. The general approach is:
Select a "salt" value. This can be publicly known. It does not matter. For this particular use case, because every account number is different, you can select a single global salt. (If the protected data could be the same, then it's important to have different salts for each record.)
Compute PBKDF2(salt, iterations, length, payload)
The number of iterations tunes how slow this operation is. The output is "effectively random" (just like a hash) and can be used in the same ways.
A common target for iterations is a value that delivers around 80-100ms. This is fairly fast on the server, but is extremely slow for brute-forcing large spaces, even if the attacker has better hardware than yours. Ideally your space should take at least millions of years to brute force (seriously; this is the kind of headroom we typically like in security; I personally target trillions of years). If it's smaller than a few years, then it probably can be brute forced quickly by throwing more hardware at it.
(Of course all of these choices can be turned based on your attack model. It depends on how dedicated and well-funded you expect attacks to be.)
A 40 character ID is 320 bits, which gives you plenty of space. With only 250 records, you can easily fit a unique counter into that. Three digits is only 24 bits, and you have the range 000 to 999 to play with. Fill up the rest of the ID with, say, the hex expression of part of the SHA-256 hash. With a 3-digit ID, that leaves 37 places for hex which covers 37*4 = 148 bits of the Sha-256 output.
You may want to put the counter in the middle of the hex string in a fixed position instead of at the beginning or end to make it less obvious.
<11 hex chars><3 digit ID><26 hex chars>
I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.
I've carefully read https://redis.io/topics/memory-optimization but I'm still confused. Basically, it says to cap the number of keys in each hash map (HSET). But what about the number of keys in each HSET.
If I have 1,000,000 keys for a certain prefix. Each one with a unique value. Suppose they're integer looking like "12345689". If I "shard" the keys by taking the first two characters (e.g. "12") and the remainder as the "sub key" (e.g. "3456789"), then for each hash I'm going to have 1,000,000 / 100 = 10,000 keys each (theoretically). Is that too many?
My (default) config is:
redis-store:6379> config get hash-*
1) "hash-max-ziplist-entries"
2) "512"
3) "hash-max-ziplist-value"
4) "64"
So, if I shard up each 1,000,000 keys per prefix, I'll have less than 512. Actually, I'll have 100 (e.g. "12" or "99"). But what about within each one? There'll theoretically be 10,000 keys each. Does that mean I break the limit and can't benefit from the space optimization that hash maps offer?
You can use such formula to calculate HASH internal data overhead for each key:
3 * next_power(n) * size_of(pointer)
There n is number of keys in your HASH. I think you are using x64 version of Redis so size_of(pointer) is 8. So for each 10,000 keys in your HASH your would have at least 240,000 bytes of overhead.
UPDATED
Please keep in mind hash-max-ziplist-entries is not the silver bullet. Please look at article here Under the hood of Rdis #2 — ziplist could be calculated as 21 * n and in same time: saving up to х10 RAM you got the write speed subsidence up to 30 times and up to 100 times in reading. So with total amount with 1,000,000 entries in HASH you could catch the critical breakdown with perfomance
You can read more about Redis HASH internals Under the hood of Redis #1.
After some extensive research I've finally understood how hash-max-ziplist-entries works.
https://www.peterbe.com/plog/understanding-redis-hash-max-ziplist-entries
Basically, it's just 1 hash map or if you need to break it up into multiple hash maps if within you need to store more keys than hash-max-ziplist-entries is set to.
Does someone know about the internals of Redis LRU based eviction / deletion.
How does Redis ensure that the older (lesser used) keys are deleted first (in case we do not have volatile keys and we are not setting TTL expiration)?
I know for sure that Redis has a configuration parameter "maxmemory-samples" that governs a sample size that it uses for removing keys - so if you set a sample size of 10 then it samples 10 keys and removes the oldest from amongst these.
What I don't know is whether it sample these key's completely randomly, or does it somehow have a mechanism that allows it to automatically sample from an equivalent of an "older / less used generation"?
This is what I found at antirez.com/post/redis-as-LRU-cache.html - the whole point of using a "sample three" algorithm is to save memory. I think this is much more valuable than precision, especially since this randomized algorithms are rarely well understood. An example: sampling with just three objects will expire 666 objects out of a dataset of 999 with an error rate of only 14% compared to the perfect LRU algorithm. And in the 14% of the remaining there are hardly elements that are in the range of very used elements. So the memory gain will pay for the precision without doubts.
So although Redis samples randomly (implying that this is not actual LRU .. and as such an approximation algorithm), the accuracy is relatively high and increasing the sampling size will further increase this. However, in case someone needs exact LRU (there is zero tolerance for error), then Redis may not be the correct choice.
Architecture ... as they say ... is about tradeoffs .. so use this (Redis LRU) approach to tradeoff accuracy for raw performance.
Since v3.0.0 (2014) the LRU algorithm uses a pool of 15 keys, populated with the best candidates out of the different samplings of N keys (where N is defined by maxmemory-samples).
Every time a key needs to be evicted, N new keys are selected randomly and checked against the pool. If they're better candidates (older keys), they're added in it, while the worst candidates (most recent keys) are taken out, keeping the pool at a constant size of 15 keys.
At the end of the round, the best eviction candidate is selected from the pool.
Source: Code and comments in evict.c file from Redis source code
Suppose an RSA algorithm created private key on two machines. Is there any possibility that both keys are the same?
Short answer: No. There is a theoretical possibility, but even if you create a key every second you aren't likely to get the same one twice before the sun explodes.
Yes. Have you heard of the pigeon-hole principle?
Normally, you create RSA keys by randomly selecting extremely large numbers and checking whether they're prime.
Given the sizes of the numbers involved (100+ digits), the only reasonable possibility of a collision is if there's a problem in the random number generator, so that (at least under some circumstances) the numbers it picks aren't very random.
This was exactly the sort of problem that led to a break in the SSL system in Netscape (~4.0, if memory serves). In this particular case, the problem was in generating a session key, but the basic idea was the same -- a fair amount of the "random" bits that were used were actually pretty predictable, so an attacker who knew the sources of the bits could fairly quickly generate the same "random" number, and therefore the same session key.
yes. but probability is very low
In the RSA cryptosystem with public key (n,e), the private key (n,d) is generated such that n = p * q, where p, q are large N-bit primes and ed − 1 can be evenly divided by the totient (p − 1)(q − 1).
To generate the same private key, you essentially need to generate the same p,q,e so it is an abysmally small probability.