Determining the key of a Vigenere Cipher if key length is known - cryptography

I'm struggling to get my head around the Vigenere Cipher when you know the length of the key but not what it is. I can decipher text if I know the key but I'm confused as to how to work out what the key actually is.
For one example I'm given cipher text and a key length of 6. That's all I'm given, I'm told the key is an arbitrary set of letters that don't necessarily have to make up a word in the english language, in other words, a random set of letters.
With this knowledge I've only so far broken up the cipher text into 6 subtexts, each containing the letters encrypted by the key letters so the first subtext contains every 6th letter starting with the first. The second every 6th letter starting with the second letter and so on.
What do I do now?

You calculate a letter frequency table for each letter of the key. If, as in your example, the key length is 6, you get 6 frequency tables. You should get similar frequencies, although not for the same letters. If you do not, the you have the wrong key length.
Now you check letter frequency tables for English (for example, see http://en.wikipedia.org/wiki/Letter_frequency). If the pattern does not match, the clear text was not in English. If it does, assign the most frequent letters in each subtext to the most frequent letters in the frequency table etc. and see what you get. You should note that your text may have slightly different frequencies, the reference tables are statistics based on a large amount of data. Now you need to use you head.
Using common digrams (such as th and sh in English) can help.

One approach is frequency analysis. Take each of the six groups and build a frequency table for each character. Then compare that table to a table of known frequencies for the plaintext (if it's standard text, this would just be the English language).
A second, possibly simpler, approach is to just brute-force each character. The number of possible keys is 26^6 ~= 300,000,000, which is about 29 bits of key space. This is brute-forceable but would probably take a bit of time on a personal computer. But if you brute-force one character at a time would only take 26*6 = 156 tries. To do so, write a function that "scores" an attempted decrypted plaintext with how "plaintext-like" it looks. You might do frequency analysis like above, but there can be simpler tests. Then brute-force each of the six sets of characters and pick the key letter that scores the best for decrypting each one of them.

Related

Bitcoin mnemonic words/keys. Can be shorter than 12 by increacing set of words? Good idea or not?

Bitcoin mnemonic private key consist of 12 words. Theese words represent private key.
I was thinking if it is possible to make it shorter than 12 by increasing set of words from which to choose. It would be easier to remember. Lets say we have set of 10000 words. How many words would be enough to represent one private key then? Does somebody know the exact calculation? Or any suggestion why this is not a good idea?
Thank you.
The seed phrase can contain any number of words that you want it to. Their function is simply to allow recreation of the private key for the Bitcoin wallet.
Regarding the length of 12 words, taken from here:
The English-language wordlist for the BIP39 standard has 2048 words,
so if the phrase contained only 12 random words, the number of
possible combinations would be 2048^12 = 2^132 and the phrase would
have 132 bits of security. However, some of the data in a BIP39 phrase
is not random, so the actual security of a 12-word BIP39 seed
phrase is only 128 bits. This is approximately the same strength as
all Bitcoin private keys, so most experts consider it to be
sufficiently secure.
So although 12 would seem the correct amount, the important thing about them is that they must be securely stored, preferably in memory, but at least written and locked somewhere safe, and never stored in plaintext online or on your PC\devices.
One afterthought - by increasing the number of words that are used in the phrase, you could actually make it less secure because the longer it is, the more certain it would be that it must be recored somewhere physical rather than in your own memory.

Map words to numbers

I am doing indexing of data in my IRE (Information Retrieval and Extraction) course. Now instead of storing terms in the index, I am storing termID which is a mapping corresponding to the term. The size of term, if the length of the term is 15, would be 15 bytes i.e. 120 bits while if I use termID instead of term then I can definitely store it in less than 120 bits. One of the possible ways is to maintain a dictionary of the (term, termID) where termID would be from 1..n where n is the number of terms. The problems with this method is:
I have to keep this dictionary in the ram and the dictionary size can be in GBs.
To find termID corresponding to a term, it will take O(log(n)) where n is the number of terms in the dictionary.
Can I make some function which takes a term as an input and returns the mapping (encryption) in O(1) ?. It is okay if there are few collisions (Just guessing that a few collisions in exchange of speed and memory is a good trade-off. BTW I don't know how much it will effect my search results).
Is there any other better way to do this?
I think you gave the answer already more or less by saying "it is ok if there are a few collisions". The trick is hashing. You can first reduce the number of "characters" in your search terms. E.g., drop numbers, and special characters. Afterwards you can merge Upper and lower-case characters. Finally you could apply some simple replacements e.g. replacing the german ü bei ue (which is actually there origin). After doing so you have probably sth. like 32bit. You can then represent an four character string in a single byte. If you reserve 4 bytes for each words you need to deal with the longer words. There you can basically resort to xor each 4byte block.
An alternative approach would be to do something hybrid for the dictionary. If you would build a dictionary for only the 10k most frequent words you are most likely covering already most of the texts. Hence, you only need to keep parts of your dictionary in memory, while for most of the words you can use dictionary on hardisc or maybe even ignore them.

Is encrypting low variance values risky?

For example a credit card expiry month can be only of only twelve values. So a hacker would have a one in twelve chance of guessing the correct encrypted value of a month. If they knew this, would they be able to crack the encryption more quickly?
If this is the case, how many variations of a value are required to avoid this? How about a bank card number security code which is commonly only three digits?
If you use a proper cipher like AES in a proper way, then encrypting such values is completely safe.
This is because modes of operation that are considered secure (such as CBC and CTR) take an additional parameter called the initialization vector, which effectively randomizes the ciphertext even if the same plain text is encrypted multiple times.
Note that it's extremely important that the IV is used correctly. Every call of the encryption function must use a different IV. For CBC mode, the IV has to be unpredictable and preferably random, while CTR requires a unique IV (a random IV is usually not a bad choice for CTR either).
Good encryption means that if the user knows for example as you mentioned that the expiration month of a credit card is one of twelve values then it will limit the number of options by just that, and not more.
i.e.
If a hacker needs to guess three numbers, a, b, c, each of them can have values from 1 to 3.
The number of options will be 3*3*3 = 27.
Now the hacker finds out that the first number, a, is always the fixed value 2.
So the number of options is 1*3*3 = 9.
If revealing the value of the number a will result in limiting the number of options to a value less then 9 than you have been cracked, but in a strong model, if one of the numbers will be revealed then the number of options to be limited will be exactly to 9.
Now you are obviously not using only the exp. date for encryption, i guess.
I hope i was clear enough.

Formula for checking the probability of a character appearing multiple times consecutively in an encrypted string

My question today is fairly specific and not so much about programming, more about statistics.
I asked myself if there is a formula how often a character is likely to appear multiple times in a row. I made the assumption that every printable character on the keyboard (95) is equally likely to appear, so that the formula would be something like:
1/95^n(*95) (= 1/95^(n-1))
(*95 if you are not making any assumptions which character and are happy with just any)
I am sorry for the eye-hurting formatting, but I did not know how to format it more clearly
Now that is kind of nice as a formula, but it is based on too many assumptions and I am sure somebody has made more of that than an educated guess. Could you point me to a paper, a person or just the formula?
EDIT: This may be different for different encryption algorithms. Up until now, I have not dwelled in the realm of statistics in cryptography. If someone could provide a paper on that(specifically character appearance probability) that would be nice as well.
Ideally, a cipher should produce ciphertext that is indistinguishable from random data. In fact, any cipher that does not fill this criterion is fundamentally weak.
In random data, each byte value is equally likely. An 8-bit byte can have 256 different values, so the probability of n consecutive bytes with the same value is (1/256)^(n-1).

Problem 98 - Project Euler

The problem is as follows:
By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^(2). What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^(2). We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter.
Using words.txt (right click and 'Save Link/Target As...'), a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself).
What is the largest square number formed by any member of such a pair?
NOTE: All anagrams formed must be contained in the given text file.
I don't understand the mapping of CARE to 1296? How does that work? or are all permutation mappings meant to be tried i.e. all letters to 1-9?
All assignments of digits to letters are allowed. So C=1, A=2, R=3, E=4 would be a possible assignment ... except that 1234 is not a square, so that would be no good.
Maybe another example would help make it clear? If we assign A=6, E=5, T=2, then TEA = 256 = 16² and EAT = 625 = 25². So (TEA=256, EAT=625) is a square anagram word pair.
(Just because all assignments of digits to letters are allowed, does not mean that actually trying out all such assignments is the best way to solve the problem. There may be some other, cleverer, way to do it.)
In short: yes, all permutations need to be tried.
If you test all substitutions letter for digit, than you are looking for pairs of squares with properties:
have same length
have same digits with number of occurrences as in input string.
It is faster to find all these pairs of squares. There are 68 squares with length 4, 216 squares with length 5, ... Filtering all squares of same length by upper properties will generate 'small' number of pairs, which are solutions you are looking for.
These data is 'static', and doesn't depend on input strings. It can be calculated once and used for all input strings.
Hmm. How to put this. The people who put together Project Euler promise that there is a solution that is under one minute for every problem, and there is only one problem that I think might fail this promise, but this is not it.
Yes, you could permute the digits, and try all permutations against all squares, but that would be a very large search space, not at all likely to be the (TM) right thing. In general, when you see that your "look" at the problem is going to generate a search that will take too long, you need to search something else.
Like, suppose you were asked to determine what numbers would be the result of multiplying two primes between 1 and a zillion. You could factor every number between 1 and a zillion, but it might be faster to take all combinations of two primes and multiply them. Since you are looking at combinations, you can start with two and go until your results are too large, then do the same with three, etc. By comparison, this should be much faster - and you don't have to multiply all the numbers out, you could take logs of all the primes and then just add them and find the limit for every prime, giving you a list of numbers you could add up.
There are a bunch of innovative solutions, but the first one you think of - especially the one you think of when Project Euler describes the problem, is likely to be wrong.
So, how can you approach this problem? There are probably too many permutations to look at, but maybe you can figure out something with mappings and comparing mappings?
(Trying to avoid giving it all away.)