What is the difference between a multi-collision and a first or second pre-image attack on a hash function? - cryptography

What is the difference between a multi-collision in a hash function and a first or second preimage.
First preimage attacks: given a hash h, find a message m such that
hash(m) = h.
Second preimage attacks: given a fixed message m1, find a different message m2 such that
hash(m2) = hash(m1).
Multi-collision attacks: generate a series of messages m1, m2, ... mN, such that
hash(m1) = hash(m2) = ... = hash(mN).
Wikipedia tells us that a preimage attack differs from a collision attack in that there is a fixed hash or message that is being attacked.
I am confused by papers with which make statements like :
The techniques are
not only efficient to search for
collisions, but also applicable to
explore the second- preimage of MD4.
About the second-preimage attack, they
showed that a random message was a
weak message with probability 2^–122
and it only needed a one-time MD4
computation to find the
second-preimage corresponding to the
weak message.
The Second-Preimage Attack on MD4
If I understand what the authors seem to be saying is that they have developed a multi-collision attack which encompasses a large enough set of messages that given a random message there is a significant though extremely small chance it will overlap with one of their multi-collisions.
I seen similar arguments in many papers. My question when does an attack stop being a multi-collision attack and become a second preimage attack..
If a multi-collision collides with 2^300 other messages does that count as a second preimage, since the multi-collision could be used to calculate the "pre-image" of one of the messages it collides with? Where is the dividing line, 2^60, 2^100, 2^1000?
What if you can generate a preimage of all hash digests that begin with 23? Certainly it doesn't meet the strict definition of a preimage, but it is also very certainly a serious flaw in the cryptographic hash function.
If someone has a large multi-collision, then they could always recover the image of the any message which hash collided with the multi-collision. For instance,
hash(m1) = hash(m2) = hash(m3) = h
Someone requests the preimage of h, and they respond with m2. When does this stop being silly and becomes a real attack?
Rules of thumb? Know of any good resources on evaluating hash function attacks?
Related Links:
HASH COLLISION Q&A
Cryptographic Hashes
The eHash Main Page

It is about an attack scenario. The difference lies in the choice of input. In multi-collision there is free choice of both inputs. 2nd preimage is about finding any second input which has the same output as any specified input.
When a function doesn't have multi-collision resistance, it may be possible to find collision for some kind of messages - not all of them. So this doesn't imply 2nd preimage weakness.

You did a lot of research before posting the question. I cannot answer much aside the resources-question. Which is: I use Applied Cryptography be Menezes/Oorschot for almost everything I ever wanted to know on topics of cryptography, including hashes.
Maybe you'll find a copy at your universities library. Good luck.

Related

Writing a computer program that will analyse the quality of another computer program?

I'm interested in knowing the possibilities of this. I'm working on a project that validates the skills of a software engineer, currently we validate skills based on code reviews by credentialed developers.
I know the answer if far more completed that the question, I couldn't imagine how complex the program would have to be able to analyse complex code but I am starting with basic programming interview questions.
For example, the classic FizzBuzz question:
Write a program that prints the numbers from 1 to 20. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.
and below is the solution in python:
for num in range(1,21):
string = ""
if num % 3 == 0:
string = string + "Fizz"
if num % 5 == 0:
string = string + "Buzz"
if num % 5 != 0 and num % 3 != 0:
string = string + str(num)
print(string)
Question is, can we programatically analyse the validity of this solution?
I would like to know if anyone has attempted this, and if there are current implementations I can take a look at. Also if anyone has used z3, and if it is something I can use to solve this problem.
As Vilx- mentioned, correctness of programs (including whether or not they terminate) is in general known to be undecidable. However, tools such as Z3 show that relevant concrete cases can still be reasoned about, despite the general undecidability of the problem.
Static analysers typically look for "simple" problems (e.g. null dereferences, out-of-bounds accesses, numerical overflows), but are comparably fast and require little user guidance (think of guidance in the spirit of adding type annotations to your code).
A non-exhaustive (and biased) list of keywords to search for: "static analysers", "abstract interpretation"; "facebook infer", "airbus absint", "juliasoft".
Verifiers attempt to prove much richer properties, in particular functional correctness, e.g. "does this sort-implementation really sort my array (and not do anything else, e.g. deallocate some global memory or update an element reachable from the array)?" or "does that crypto-implementation really implement the crypto protocol it promises to implement?". This is a much harder task and tools from that line of research are typically rather slow, require expert users with a background in formal verification and significant user guidance.
A non-exhaustive (and biased) list of keywords to search for: "verification", "hoare logic", "separation logic"; "eth viper", "microsoft dafny", "kuleuven verifast", "microsoft f*".
Other formal methods exist, e.g. refinement (or correct-by-construction), but with even less tool support and, as far as I know, industry acceptance.
Let's put it this way: it's been mathematically proven that you CANNOT determine if a program will ever terminate. So if you want a mathematically perfect answer of if the target program is correct, you're doomed.
That said, you can still do unit tests and "linting" which will give you plenty of intetesting insights.
But for simple pieces of code like the FizzBuzz, I think that eyeballing by an experienced dev will probably bring the best results.

Analyzing time complexity (Poly log vs polynomial)

Say an algorithm runs at
[5n^3 + 8n^2(lg (n))^4]
Which is the first order term? Would it be the one with the poly log or the polynomial?
For each two constants a>0,b>0, log(n)^a is in o(n^b) (Note small o notation here).
One way to prove this claim is examine what happens when we apply a monotomically increasing function on both sides: the log function.
log(log(n)^a)) = a* log(log(n))
log(n^b) = b * log(n)
Since we know we can ignore constants when it comes to asymptotic notations, we can see that the answer to "which is bigger" log(n)^a or n^b, is the same as "which is bigger": log(log(n)) and log(n). This answer is much more intuitive to answer.

What makes the trapdoor function in elliptic curve cryptography hard to reverse?

I've been reading this article on elliptic-curve crypto and how it works:
http://arstechnica.com/security/2013/10/a-relatively-easy-to-understand-primer-on-elliptic-curve-cryptography/
In the article, they state:
It turns out that if you have two points [on an elliptic curve], an initial point "dotted" with itself n times to arrive at a final point [on the curve], finding out n when you only know the final point and the first point is hard.
It goes on to state that the only way to find out n (if you only have the first and final points, and you know the curve eqn), is to repeatedly dot the initial point until you finally have the matching final point.
I think I understand all this - but what confuses me is - if n is the private key, and the final point corresponds to the public key (which I think is the case), then doesn't it take the exact same amount of work to compute the public key from the private, as it does the private from the public (both just have to recursively dot a point on the curve)? am I misunderstanding something about what the article is saying?
The one-way attribute of ECC and RSA is due to the Chinese Reminder Theoreom (CRT). A series of arithmetic divisions where only the remainder is kept (aka Modulo operation %), which results in information loss in the output. As a result, the person with the keys takes one direct path to generate the output - and any would-be attacker has to exhaust a massive number of possible paths in order to determine what key was used to create the output. If the simple division was used instead of a modulo - then key data would be present in the output and it couldn't be used for cryptography.
If you lived in a world where you had a powerful enough computer to exhaust all possibilities - then the CRT wouldn't be useful as a cryptographic primitive. The computers we have now a fairly powerful - so we balance the power of our modern machines with a keysize that introduces enough range of possibilities so that they cannot be exhausted in a timeframe that matters.
The CRT is a subset of the P vs NP problem set - so perhaps proving P=NP may lead to a way of undermining the oneway aspect of asymmetric cryptography. We know that there is a way to factor CRT using a quantum computer running Shor's Algorithm. Shor's Algorithm has proven that we can defeat the so-called "trapdoor", or one-way attributes of CRT, it is still however an expensive attack to conduct.
The following lecture is my favorite description of the CRT. It shows that there are many possible solutions for one direction forcing an attacker to exhaust them all and only one solution for the other:
https://www.youtube.com/watch?v=ru7mWZJlRQg
EDIT: I previously stated that n is not the private key. In your example, n is either server or client private key.
How it works is that there is a starting point known to anybody.
You select random integer k and do the "dotting operation" k-times. Then you send this new point to the server. (k is your private key)
Server does the same with the starting point, but q-times and sends it to you. (q is server's private key)
You take the point you got from server and "dot" it k-times. The final point would be the starting point "dotted" k*q-times.
Server does the same with point it got from you. And again its final point would be the starting point "dotted" q*k-times.
That means the final point (= the starting point "dotted" k*q-times) is the shared secret since all what any attacker would know is the starting point, the starting point dotted k-times and the starting point dotted q-times. And given only those data, it's practically impossible to find the final point as a product of k*q unless any of those known.
EDIT: No, it doesn't take the same time to compute k from G = kP given known values of G (sent point) and P (starting point). More in comment section and:
For rising to power, see Exponentiation by squaring.
For ECC point multiplication, see point multiplication.

RSACryptoPad gives different results each time

I'm trying to test my RSA implementation's correctness with the RSACryptoPAD example in here: http://www.codeproject.com/Articles/10877/Public-Key-RSA-Encryption-in-C-NET
But it always creates different encryption results. Isn't RSA just a mod and power operation? But the program can decrypt all different encrypted texts correctly. My results are same with the http://nmichaels.org/rsa.py site. I think RSACryptoPAD is doing some other things?
The code uses RSACryptoServiceProvider. The line
byte[] encryptedBytes = rsaCryptoServiceProvider.Encrypt( tempBytes, true );
Tells it to encrypt using OAEP padding, which introduces randomness into the padding. The reason for doing this is so that encrypting the same plaintext will always yield different results (as you are seeing). This is a good thing, as it stops an information leak where an attacker sees you sent the same message multiple times.
There's a great historical example of why this is important. "AF is short of water"

Collision Attacks, Message Digests and a Possible solution

I've been doing some preliminary research in the area of message digests. Specifically collision attacks of cryptographic hash functions such as MD5 and SHA-1, such as the Postscript example and X.509 certificate duplicate.
From what I can tell in the case of the postscript attack, specific data was generated and embedded within the header of the postscript file (which is ignored during rendering) which brought about the internal state of the md5 to a state such that the modified wording of the document would lead to a final MD value equivalent to the original postscript file.
The X.509 took a similar approach where by data was injected within the comment/whitespace sections of the certificate.
Ok so here is my question, and I can't seem to find anyone asking this question:
Why isn't the length of ONLY the data being consumed added as a final block to the MD calculation?
In the case of X.509 - Why is the whitespace and comments being taken into account as part of the MD?
Wouldn't a simple processes such as one of the following be enough to resolve the proposed collision attacks:
MD(M + |M|) = xyz
MD(M + |M| + |M| * magicseed_0 +...+ |M| * magicseed_n) = xyz
where :
M : is the message
|M| : size of the message
MD : is the message digest function (eg: md5, sha, whirlpool etc)
xyz : is the pairing of the acutal message digest value for the message M and |M|. <M,|M|>
magicseed_{i}: Is a set of random values generated with seed based on the internal-state prior to the size being added.
This technqiue should work, as to date all such collision attacks rely on adding more data to the original message.
In short, the level of difficulty involved in generating a collision message such that:
It not only generates the same MD
But is also comprehensible/parsible/compliant
and is also the same size as the original message,
is immensely difficult if not near impossible. Has this approach ever been discussed? Any links to papers etc would be nice.
Further Question: What is the lower bound for collisions of messages of common length for a hash function H chosen randomly from U, where U is the set of universal hash functions ?
Is it 1/N (where N is 2^(|M|)) or is it greater? If it is greater, that implies there is more than 1 message of length N that will map to the same MD value for a given H.
If that is the case, how practical is it to find these other messages? bruteforce would be of O(2^N), is there a method of time complexity less than bruteforce?
Can't speak for the rest of the questions, but the first one is fairly simple - adding length data to the input of the md5, at any stage of the hashing process (1st block, Nth block, final block) just changes the output hash. You couldn't retrieve that length from the output hash string afterwards. It's also not inconceivable that a collision couldn't be produced from another string with the exact same length in the first place, so saying "the original string was 17 bytes" is meaningless, because the colliding string could also be 17 bytes.
e.g.
md5("abce(17bytes)fghi") = md5("abdefghi<long sequence of text to produce collision>")
is still possible.
In the case of X.509 certificates specifically, the "comments" are not comments in the programming language sense: they are simply additional attributes with an OID that indicates they are to be interpreted as comments. The signature on a certificate is defined to be over the DER representation of the entire tbsCertificate ('to be signed' certificate) structure which includes all the additional attributes.
Hash function design is pretty deep theory, though, and might be better served on the Theoretical CS Stack Exchange.
As #Marc points out, though, as long as more bits can be modified than the output of the hash function contains, then by the pigeonhole principle a collision must exist for some pair of inputs. Because cryptographic hash functions are in general designed to behave pseudo-randomly over their inputs, collisions will tend toward being uniformly distributed over possible inputs.
EDIT: Incorporating the message length into the final block of the hash function would be equivalent to appending the length of everything that has gone before to the input message, so there's no real need to modify the hash function to do this itself; rather, specify it as part of the usage in a given context. I can see where this would make some types of collision attacks harder to pull off, since if you change the message length there's a changed field "downstream" of the area modified by the attack. However, this wouldn't necessarily impede the X.509 intermediate CA forgery attack since the length of the tbsCertificate is not modified.