Is it possible to get identical SHA1 hash? [duplicate] - cryptography

This question already has answers here:
Probability of SHA1 collisions
(3 answers)
Closed 6 years ago.
Given two different strings S1 and S2 (S1 != S2) is it possible that:
SHA1(S1) == SHA1(S2)
is True?
If yes - with what probability?
If not - why not?
Is there a upper bound on the length of a input string, for which the probability of getting duplicates is 0? OR is the calculation of SHA1 (hence probability of duplicates) independent of the length of the string?
The goal I am trying to achieve is to hash some sensitive ID string (possibly joined together with some other fields like parent ID), so that I can use the hash value as an ID instead (for example in the database).
Example:
Resource ID: X123
Parent ID: P123
I don't want to expose the nature of my resource identifies to allow client to see "X123-P123".
Instead I want to create a new column hash("X123-P123"), let's say it's AAAZZZ. Then the client can request resource with id AAAZZZ and not know about my internal id's etc.

What you describe is called a collision. Collisions necessarily exist, since SHA-1 accepts many more distinct messages as input that it can produce distinct outputs (SHA-1 may eat any string of bits up to 2^64 bits, but outputs only 160 bits; thus, at least one output value must pop up several times). This observation is valid for any function with an output smaller than its input, regardless of whether the function is a "good" hash function or not.
Assuming that SHA-1 behaves like a "random oracle" (a conceptual object which basically returns random values, with the sole restriction that once it has returned output v on input m, it must always thereafter return v on input m), then the probability of collision, for any two distinct strings S1 and S2, should be 2^(-160). Still under the assumption of SHA-1 behaving like a random oracle, if you collect many input strings, then you shall begin to observe collisions after having collected about 2^80 such strings.
(That's 2^80 and not 2^160 because, with 2^80 strings you can make about 2^159 pairs of strings. This is often called the "birthday paradox" because it comes as a surprise to most people when applied to collisions on birthdays. See the Wikipedia page on the subject.)
Now we strongly suspect that SHA-1 does not really behave like a random oracle, because the birthday-paradox approach is the optimal collision searching algorithm for a random oracle. Yet there is a published attack which should find a collision in about 2^63 steps, hence 2^17 = 131072 times faster than the birthday-paradox algorithm. Such an attack should not be doable on a true random oracle. Mind you, this attack has not been actually completed, it remains theoretical (some people tried but apparently could not find enough CPU power)(Update: as of early 2017, somebody did compute a SHA-1 collision with the above-mentioned method, and it worked exactly as predicted). Yet, the theory looks sound and it really seems that SHA-1 is not a random oracle. Correspondingly, as for the probability of collision, well, all bets are off.
As for your third question: for a function with a n-bit output, then there necessarily are collisions if you can input more than 2^n distinct messages, i.e. if the maximum input message length is greater than n. With a bound m lower than n, the answer is not as easy. If the function behaves as a random oracle, then the probability of the existence of a collision lowers with m, and not linearly, rather with a steep cutoff around m=n/2. This is the same analysis than the birthday paradox. With SHA-1, this means that if m < 80 then chances are that there is no collision, while m > 80 makes the existence of at least one collision very probable (with m > 160 this becomes a certainty).
Note that there is a difference between "there exists a collision" and "you find a collision". Even when a collision must exist, you still have your 2^(-160) probability every time you try. What the previous paragraph means is that such a probability is rather meaningless if you cannot (conceptually) try 2^160 pairs of strings, e.g. because you restrict yourself to strings of less than 80 bits.

Yes it is possible because of the pigeon hole principle.
Most hashes (also sha1) have a fixed output length, while the input is of arbitrary size. So if you try long enough, you can find them.
However, cryptographic hash functions (like the sha-family, the md-family, etc) are designed to minimize such collisions. The best attack known takes 2^63 attempts to find a collision, so the chance is 2^(-63) which is 0 in practice.

git uses SHA1 hashes as IDs and there are still no known SHA1 collisions in 2014. Obviously, the SHA1 algorithm is magic. I think it's a good bet that collisions don't exist for strings of your length, as they would have been discovered by now. However, if you don't trust magic and are not a betting man, you could generate random strings and associate them with your IDs in your DB. But if you do use SHA1 hashes and become the first to discover a collision, you can just change your system to use random strings at that time, retaining the SHA1 hashes as the "random" strings for legacy IDs.

A collision is almost always possible in a hashing function. SHA1, to date, has been pretty secure in generating unpredictable collisions. The danger is when collisions can be predicted, it's not necessary to know the original hash input to generate the same hash output.
For example, attacks against MD5 have been made against SSL server certificate signing last year, as exampled on the Security Now podcast episode 179. This allowed sophisticated attackers to generate a fake SSL server cert for a rogue web site and appear to be the reaol thing. For this reason, it is highly recommended to avoid purchasing MD5-signed certs.

What you are talking about is called a collision. Here is an article about SHA1 collisions:
http://www.rsa.com/rsalabs/node.asp?id=2927
Edit: So another answerer beat me to mentioning the pigeon hole principle LOL, but to clarify this is why it's called the pigeon hole principle, because if you have some holes cut out for carrier pigeons to nest in, but you have more pigeons than holes, then some of the pigeons(an input value) must share a hole(the output value).

Related

is SHA-512 collision resistant?

According to the books that i have read, it says that S.H.A(Secure Hash Algorithm) is collision resistant.But if the input space is a 1024 bit number and the output space is a 512 bit message digest then shouldn't it be colliding for
(2^1024)/(2^512) times? As the range is lesser than the domain being mapped there should have been collisions. please explain where i am going wrong.
The chance for a collision does not depend on the input size. The chance to a 512-bit hash collision is 1.4×10^77, see Probability table
Maybe your book has also mentioned the definition of collision resistance? It does not mean that no collisions are created (which is clearly not the case), but that given a hash you are not able to create a message easily that produces this hash.
a hash function H is collision resistant if it is hard to find two
inputs that hash to the same output; that is, two inputs a and b such
that H(a) = H(b), and a ≠ b
From Wikipedia
As you describe: Since the input space (arbitrary size) is larger than the output space (e.g. 512bit for sha512), there always exist collisions.
"Collision resistant" means, it is adequately unlikely for a collision to be found.
Your confusion is answered when considering how large the output space "512 bits" really is:
2^512 (the number of possible configurations of a 512 bit array) is of the order 10^154.
For comparison: The number of atoms in the visible universe is somewhere in the range of 10^80.
A million is 10^6.
So a million of our 'visible universes' has 10^86 atoms.
A million times a million universes has 10^92 atoms.
If you could store a single 512 bit value on a single atom, how many universes would you need to have all possible 512 bit has values stored?
Starting with a specific 512bit number (and assuming the has function is not broken), the probability p to obtain a collision is assuming you can produce new hashes with a rate R and have the total time of t to do this is:
p = R*t/(2^(512/2))
(The exponent is halved, see "birthday attach". The expected search space for a success is to find a collision in n bits is n/2.)
Let's plugin in some example numbers:
The has rate of the bitcoin network is currently about R = 200*10^15 / s (200 million terrahashes per second).
Consider the situation that since the beginning of the universe the bitcoin network's current hashing capacity would have been available for the sole purpose of finding a collision for a specific hash value, i.e. for an available time of t=13.787*10^9 years,
then the probability that a collision would have been found by now is about 7 × 10^-41 %
Again, it is hard to appreciate how small this number is.
Edit: A similar question with a good answer is found here: https://crypto.stackexchange.com/questions/89558/are-sha-256-and-sha-512-collision-resistant

Is encrypting low variance values risky?

For example a credit card expiry month can be only of only twelve values. So a hacker would have a one in twelve chance of guessing the correct encrypted value of a month. If they knew this, would they be able to crack the encryption more quickly?
If this is the case, how many variations of a value are required to avoid this? How about a bank card number security code which is commonly only three digits?
If you use a proper cipher like AES in a proper way, then encrypting such values is completely safe.
This is because modes of operation that are considered secure (such as CBC and CTR) take an additional parameter called the initialization vector, which effectively randomizes the ciphertext even if the same plain text is encrypted multiple times.
Note that it's extremely important that the IV is used correctly. Every call of the encryption function must use a different IV. For CBC mode, the IV has to be unpredictable and preferably random, while CTR requires a unique IV (a random IV is usually not a bad choice for CTR either).
Good encryption means that if the user knows for example as you mentioned that the expiration month of a credit card is one of twelve values then it will limit the number of options by just that, and not more.
i.e.
If a hacker needs to guess three numbers, a, b, c, each of them can have values from 1 to 3.
The number of options will be 3*3*3 = 27.
Now the hacker finds out that the first number, a, is always the fixed value 2.
So the number of options is 1*3*3 = 9.
If revealing the value of the number a will result in limiting the number of options to a value less then 9 than you have been cracked, but in a strong model, if one of the numbers will be revealed then the number of options to be limited will be exactly to 9.
Now you are obviously not using only the exp. date for encryption, i guess.
I hope i was clear enough.

Hash Cryptographic Function Output Anomolies

Anyone know if MD5, Whirlpool, SHA[n], etc., have any "special" input that might get a hexdigest output to align into:
All numeric characters
All alpha characters
All of the same character/pattern repeated consistently or entirely
Example in python:
>>> from hashlib import sha1
>>> hash = sha1('magic_word').hexdigest()
>>> hash
4040404040404040404040404040404040404040
>>> hash = sha1('^3&#b d *#"').hexdigest()
aedefeebadcdccebefadcedddcbeadaedcbdeadc
Is this even possible? My knowledge of hashing functions is limited to the scope of applying them in databases for storing passwords, which is essentially none.
But sometimes I wonder, when testing for collisions, that these sorts of cases might arise...
A hash function models a random oracle: for each input, if it was not yet queried before, we throw some dice to find an output, then note it to some book. If an input is queried again, simply give back this old value.
By throwing a 16-sided dice 40 times (for each input), we get enough output for an SHA-1 like oracle. (For MD5, we only need 32 times.)
So, we can calculate the probability of "40 times only letters" as (6/16)^40 ≈ 9.15·10^-18, "40 times only digits" has probability (10/16)^40 ≈ 6.8·10^-9.
As "number of tries needed until the first success" is geometrically distributed, we need 1/p tries in average, i.e. around 10^17 tries for "only letters", and 1.5 ·10^8 tries for "only digits".
(Now, SHA-1 is not a real random oracle, but there is no weakness known which would say that SHA-1 would have better or worse probabilities for one of these. And for now, brute-force really seems to be the best way to do this.)
I'm sure with the right input, those sorts of outputs are possible. Why does it matter? Just curious?
Yes, it is possible. Given the right input, any desired bit pattern can be output. It might take a few million years to find the right input though.
For a reasonably wide target, like all hex 0-9 or all hex a-f it should be relatively easy. Calculating the proportion of acceptable outputs, in all possible outputs will help you get an estimate of the running time. Brute force or random searching will eventually find something that hits the target. For a broken hash, like MD4, you might be able to shave something off the expected time.

Why in some cases are used only the first x chars of a md5 hash instead of using all of them?

For example commit list on GitHub shows only first 10, or this line from tornadoweb which uses only 5
return static_url_prefix + path + "?v=" + hashes[abs_path][:5]
Are only the first 5 chars enough to make sure that 2 different hashes for 2 different files won't collide?
LE: The example above from tornadoweb uses md5 hash for generating a query sting for static file caching.
In general, No.
In fact, even if a full MD5 hash were given, it wouldn't be enough to prevent malicious users from generating collisions---MD5 is broken. Even with a better hash function, five characters is not enough.
But sometimes you can get away with it.
I'm not sure exactly what the context of the specific example you provided is. However, to answer your more general question, if there aren't bad guys actively trying to cause collisions, than using part of the hash is probably okay. In particular, given 5 hex characters (20 bits), you won't expect collisions before around 2^(20/2) = 2^10 ~ one thousand values are hashed. This is a consequence of the the Birthday paradox.
The previous paragraph assumes the hash function is essentially random. This is not an assumption anyone trying to make a cryptographically secure system should make. But as long as no one is intentionally trying to create collisions, it's a reasonable heuristic.

Cost of Preimage attack

I need to know the cost of succeeding with a Preimage attack ("In cryptography, a preimage attack on a cryptographic hash is an attempt to find a message that has a specific hash value.", Wikipedia).
The message I want to hash consists of six digits (the date of birth), then four random digits. This is a social security number.
Is there also a possibility to hash something using a specific password. This would introduce another layer of security as one would have to know the password in order to produce the same hash values for a message.
I am thinking about using SHA-2.
If you want to know how expensive it is to find a preimage for the string you're describing, you need to figure out how many possible strings there are. Since the first 6 digits are a date of birth, their value is even more restricted than the naive assumption of 10^6 - we have an upper bound of 366*100 (every day of the year, plus the two digit year).
The remaining 4 'random' digits permit another 10^4 possibilities, giving a total number of distinct hashes of 366 * 100 * 10^4 = 366,000,000 hashes.
With that few possibilities, it ought to be possible to find a preimage in a fraction of a second on a modern computer - or, for that matter, to build a lookup table for every possible hash.
Using a salt, as Tom suggests, will make a lookup table impractical, but with such a restricted range of valid values, a brute force attack is still eminently practical, so it alone is not sufficient to make the attack impractical.
One way to make things more expensive is to use iterative hashing - that is, hash the hash, and hash that, repeatedly. You have to do a lot less hashing than your attacker does, so increases in cost affect them more than they do you. This is still likely to be only a stopgap given the small search space, however.
As far as "using a password" goes, it sounds like you're looking for an HMAC - a construction that uses a hash, but can only be verified if you have the key. If you can keep the key secret - no easy task if you're assuming the hashes can only be obtained if your system is compromised in the first place - this is a practical system.
Edit: Okay, so 'fractions of a second' may have been a slight exaggeration, at least with my trivial Python test. It's still perfectly tractable to bruteforce on a single computer in a short timeframe, however.
SHA-2, salts, preimage atttacks, brute forcing a restricted, 6-digit number - man it would be awesome if we have a dial we could turn that would let us adjust the security. Something like this:
Time to compute a hash of an input:
SHA-2, salted Better security!
| |
\|/ \|/
|-----------------------------------------------------|
.01 seconds 3 seconds
If we could do this, your application, when verifying that the user entered data matches what you have hashed, would in fact be a few seconds slower.
But imagine being the attacker!
Awesome, he's hashing stuff using a salt, but there's only 366,000,000 possible hashes, I'm gonna blaze through this at 10,000 a second and finish in ~10 hours!
Wait, what's going on! I can only do 1 every 2.5 seconds?! This is going to take me 29 years!!
That would be awesome, wouldn't it?
Sure would.
I present unto you: scrypt and bcrypt. They give you that dial. Want to spend a whole minute hashing a password? They can do that. (Just make sure you remember the salt!)
I'm unsure what your question is exactly, but to make your encrypted value more secure, use salt values.
Edit: I think you are sort of describing salt values in your question.