Vigenere cipher-Cryptanalytic attack - cryptography

I implemented a program to break a Vigenere cipher. I've used the algorithm presented on this site.
Step 1- Finding the key length: I suppose that the key length is 2, 3, ..., max key length and I extract substrings for each key length(2 substrings for key length 2, 3 substrings for key length 3 and so on). Each of the extracted substring was ciphered using the Caesar's cipher. Then I compute the index of coincidence for each of the extracted substring for a certain period and then I take the average IC for each period. The period with the biggest average IC represents the Vigenere key length.
Step2- Finding the key: Knowing the key length(let us suppose it is n), I extract the n substrings(meaning n Caesar ciphers) from the Vigenere cipher and for each substring I apply a Caesar brute force attack. Then for each string generated by this attack, I compute the chi-squared statistic to find out the key for each Caesar cipher. In the end, the keys for Caesar ciphers will form the Vigenere key.
The problem is that on average only 2/3 of the letters from the Vigenere key are found properly. I want to know if there is something I'm doing wrong. Is there any chance I could get the correct key using this approach?
Any advice would be greatly appreciated.

Related

Are AES keys just random bytes of a specific length or is there some sort of extra checks?

Since I want to scale up a simple website but I just need a simple encryption done through environment variables rather than setting up a Redis to hold the key.
I'm looking at this Converting Secret Key into a String and Vice Versa to do the retrieval.
I know I can export the string but I was wondering if any arbitrary bytes can be used so long as it meets the length requirement.
An AES key is a sequence of 16, 24, or 32 bytes chosen by a cryptographically secure random number generator. There are no checks other than the length.

Merkle–Hellman knapsack cryptosystem - My exam

I'm learning about Merkle-Hellman cryptosystem.
Here is my question : Why chose q
:
https://en.wikipedia.org/wiki/Merkle–Hellman_knapsack_cryptosystem
Thanks all.
The answer is in the next few sentences of that same Wikipedia article:
q is chosen this way to ensure the uniqueness of the ciphertext. If it is any smaller, more than one plaintext may encrypt to the same ciphertext. Since q is larger than the sum of every subset of w, no sums are congruent mod q and therefore none of the private key's sums will be equal.
So in short q is chosen to ensure uniqueness of the ciphertext which is important. If I have message a which encrypts to b and message c also encrypts to b then there is no unique decryption for b. b could be either a or c. It is important that encryption/decryption algorithms are one-to-one from plaintext to ciphertext otherwise it becomes difficult to encrypt/decrypt - there would be an element of guessing involved.

Determining the key of a Vigenere Cipher if key length is known

I'm struggling to get my head around the Vigenere Cipher when you know the length of the key but not what it is. I can decipher text if I know the key but I'm confused as to how to work out what the key actually is.
For one example I'm given cipher text and a key length of 6. That's all I'm given, I'm told the key is an arbitrary set of letters that don't necessarily have to make up a word in the english language, in other words, a random set of letters.
With this knowledge I've only so far broken up the cipher text into 6 subtexts, each containing the letters encrypted by the key letters so the first subtext contains every 6th letter starting with the first. The second every 6th letter starting with the second letter and so on.
What do I do now?
You calculate a letter frequency table for each letter of the key. If, as in your example, the key length is 6, you get 6 frequency tables. You should get similar frequencies, although not for the same letters. If you do not, the you have the wrong key length.
Now you check letter frequency tables for English (for example, see http://en.wikipedia.org/wiki/Letter_frequency). If the pattern does not match, the clear text was not in English. If it does, assign the most frequent letters in each subtext to the most frequent letters in the frequency table etc. and see what you get. You should note that your text may have slightly different frequencies, the reference tables are statistics based on a large amount of data. Now you need to use you head.
Using common digrams (such as th and sh in English) can help.
One approach is frequency analysis. Take each of the six groups and build a frequency table for each character. Then compare that table to a table of known frequencies for the plaintext (if it's standard text, this would just be the English language).
A second, possibly simpler, approach is to just brute-force each character. The number of possible keys is 26^6 ~= 300,000,000, which is about 29 bits of key space. This is brute-forceable but would probably take a bit of time on a personal computer. But if you brute-force one character at a time would only take 26*6 = 156 tries. To do so, write a function that "scores" an attempted decrypted plaintext with how "plaintext-like" it looks. You might do frequency analysis like above, but there can be simpler tests. Then brute-force each of the six sets of characters and pick the key letter that scores the best for decrypting each one of them.

Is encrypting low variance values risky?

For example a credit card expiry month can be only of only twelve values. So a hacker would have a one in twelve chance of guessing the correct encrypted value of a month. If they knew this, would they be able to crack the encryption more quickly?
If this is the case, how many variations of a value are required to avoid this? How about a bank card number security code which is commonly only three digits?
If you use a proper cipher like AES in a proper way, then encrypting such values is completely safe.
This is because modes of operation that are considered secure (such as CBC and CTR) take an additional parameter called the initialization vector, which effectively randomizes the ciphertext even if the same plain text is encrypted multiple times.
Note that it's extremely important that the IV is used correctly. Every call of the encryption function must use a different IV. For CBC mode, the IV has to be unpredictable and preferably random, while CTR requires a unique IV (a random IV is usually not a bad choice for CTR either).
Good encryption means that if the user knows for example as you mentioned that the expiration month of a credit card is one of twelve values then it will limit the number of options by just that, and not more.
i.e.
If a hacker needs to guess three numbers, a, b, c, each of them can have values from 1 to 3.
The number of options will be 3*3*3 = 27.
Now the hacker finds out that the first number, a, is always the fixed value 2.
So the number of options is 1*3*3 = 9.
If revealing the value of the number a will result in limiting the number of options to a value less then 9 than you have been cracked, but in a strong model, if one of the numbers will be revealed then the number of options to be limited will be exactly to 9.
Now you are obviously not using only the exp. date for encryption, i guess.
I hope i was clear enough.

Is it possible to get identical SHA1 hash? [duplicate]

This question already has answers here:
Probability of SHA1 collisions
(3 answers)
Closed 6 years ago.
Given two different strings S1 and S2 (S1 != S2) is it possible that:
SHA1(S1) == SHA1(S2)
is True?
If yes - with what probability?
If not - why not?
Is there a upper bound on the length of a input string, for which the probability of getting duplicates is 0? OR is the calculation of SHA1 (hence probability of duplicates) independent of the length of the string?
The goal I am trying to achieve is to hash some sensitive ID string (possibly joined together with some other fields like parent ID), so that I can use the hash value as an ID instead (for example in the database).
Example:
Resource ID: X123
Parent ID: P123
I don't want to expose the nature of my resource identifies to allow client to see "X123-P123".
Instead I want to create a new column hash("X123-P123"), let's say it's AAAZZZ. Then the client can request resource with id AAAZZZ and not know about my internal id's etc.
What you describe is called a collision. Collisions necessarily exist, since SHA-1 accepts many more distinct messages as input that it can produce distinct outputs (SHA-1 may eat any string of bits up to 2^64 bits, but outputs only 160 bits; thus, at least one output value must pop up several times). This observation is valid for any function with an output smaller than its input, regardless of whether the function is a "good" hash function or not.
Assuming that SHA-1 behaves like a "random oracle" (a conceptual object which basically returns random values, with the sole restriction that once it has returned output v on input m, it must always thereafter return v on input m), then the probability of collision, for any two distinct strings S1 and S2, should be 2^(-160). Still under the assumption of SHA-1 behaving like a random oracle, if you collect many input strings, then you shall begin to observe collisions after having collected about 2^80 such strings.
(That's 2^80 and not 2^160 because, with 2^80 strings you can make about 2^159 pairs of strings. This is often called the "birthday paradox" because it comes as a surprise to most people when applied to collisions on birthdays. See the Wikipedia page on the subject.)
Now we strongly suspect that SHA-1 does not really behave like a random oracle, because the birthday-paradox approach is the optimal collision searching algorithm for a random oracle. Yet there is a published attack which should find a collision in about 2^63 steps, hence 2^17 = 131072 times faster than the birthday-paradox algorithm. Such an attack should not be doable on a true random oracle. Mind you, this attack has not been actually completed, it remains theoretical (some people tried but apparently could not find enough CPU power)(Update: as of early 2017, somebody did compute a SHA-1 collision with the above-mentioned method, and it worked exactly as predicted). Yet, the theory looks sound and it really seems that SHA-1 is not a random oracle. Correspondingly, as for the probability of collision, well, all bets are off.
As for your third question: for a function with a n-bit output, then there necessarily are collisions if you can input more than 2^n distinct messages, i.e. if the maximum input message length is greater than n. With a bound m lower than n, the answer is not as easy. If the function behaves as a random oracle, then the probability of the existence of a collision lowers with m, and not linearly, rather with a steep cutoff around m=n/2. This is the same analysis than the birthday paradox. With SHA-1, this means that if m < 80 then chances are that there is no collision, while m > 80 makes the existence of at least one collision very probable (with m > 160 this becomes a certainty).
Note that there is a difference between "there exists a collision" and "you find a collision". Even when a collision must exist, you still have your 2^(-160) probability every time you try. What the previous paragraph means is that such a probability is rather meaningless if you cannot (conceptually) try 2^160 pairs of strings, e.g. because you restrict yourself to strings of less than 80 bits.
Yes it is possible because of the pigeon hole principle.
Most hashes (also sha1) have a fixed output length, while the input is of arbitrary size. So if you try long enough, you can find them.
However, cryptographic hash functions (like the sha-family, the md-family, etc) are designed to minimize such collisions. The best attack known takes 2^63 attempts to find a collision, so the chance is 2^(-63) which is 0 in practice.
git uses SHA1 hashes as IDs and there are still no known SHA1 collisions in 2014. Obviously, the SHA1 algorithm is magic. I think it's a good bet that collisions don't exist for strings of your length, as they would have been discovered by now. However, if you don't trust magic and are not a betting man, you could generate random strings and associate them with your IDs in your DB. But if you do use SHA1 hashes and become the first to discover a collision, you can just change your system to use random strings at that time, retaining the SHA1 hashes as the "random" strings for legacy IDs.
A collision is almost always possible in a hashing function. SHA1, to date, has been pretty secure in generating unpredictable collisions. The danger is when collisions can be predicted, it's not necessary to know the original hash input to generate the same hash output.
For example, attacks against MD5 have been made against SSL server certificate signing last year, as exampled on the Security Now podcast episode 179. This allowed sophisticated attackers to generate a fake SSL server cert for a rogue web site and appear to be the reaol thing. For this reason, it is highly recommended to avoid purchasing MD5-signed certs.
What you are talking about is called a collision. Here is an article about SHA1 collisions:
http://www.rsa.com/rsalabs/node.asp?id=2927
Edit: So another answerer beat me to mentioning the pigeon hole principle LOL, but to clarify this is why it's called the pigeon hole principle, because if you have some holes cut out for carrier pigeons to nest in, but you have more pigeons than holes, then some of the pigeons(an input value) must share a hole(the output value).