Initialization vector - best practices (symmetric cryptography) - cryptography

I would like to ask about best practices regarding a usage of an initialization vector (IV) and a key for symmetric cryptography algorithms.
I want to accept messages from a client, encrypt them and store in a backend. This will be done over a time, and there will be requests coming at a later time for pooling out the messages and return them in a readable form.
According what I know, the key can be the same during the encryption of multiple separate messages. The IV should change with every new encryption. This however, will cause problems, because every message will need a different IV for de-cryption at a later time.
I’d like to know if this is the best way of doing it. Is there any way to avoid storing IV with every message, which would simplify entire process of working with encryption/decryption?

IV selection is a bit complicated because the exact requirements depend on the mode of operation. There are some general rules, however:
You can't go wrong¹ with a random IV, except when using shorter IVs in modes that allow this.
Never use the same IV with the same key.
If you only ever encrypt a single message with a given key, the choice of IV doesn't matter².
Choose the IV independently of the data to encrypt.
Never use ECB.
Of the most common specific modes of operation:
CBC requires the IV to be generated uniformly at random. Do not use a counter as IV for CBC. Furthermore, if you're encrypting some data that contains parts that you receive from a third party, don't reveal the IV until you've fully received the data, .
CTR uses the IV as the initial value of a counter which is incremented for every block, not for every message, and the counter value needs to be unique for every block. A block is 16 bytes for all modern symmetric ciphers (including AES, regardless of the key size). So for CTR, if you encrypt a 3-block message (33 to 48 bytes) with 0 as the IV, the next message must start with IV=3 (or larger), not IV=1.
Modern modes such as Chacha20, GCM, CCM, SIV, etc. use a nonce as their IV. When a mode is described as using a nonce rather than an IV, this means that the only requirement is that the IV is never reused with the same key. It doesn't have to be random.
When encrypting data in a database, it is in general not safe to use the row ID (or a value derived from it) as IV. Using the row ID is safe only if the row is never updated or removed, because otherwise the second time data is stored using the same ID, it would repeat the IV. An adversary who sees two different messages encrypted with the same key and IV may well be able to decrypt both messages (the details depend on the mode and on how much the attacker can guess about the message content; note that even weak guesses such as “it's printable UTF-8” may suffice).
Unless you have a very good reason to do otherwise (just saving a few bytes per row does not count as a very good reason) and a cryptographer has reviewed the specific way in which you are storing and retrieving the data:
Use an authenticated encryption mode such as GCM, CCM, SIV or Chacha20+Poly1305.
If you can store a counter somewhere and make sure that it's never reset as long as you keep using the same encryption key, then each time you encrypt a message:
Increment the counter.
Use the new value of the counter as the nonce for the authenticated encryption.
The reason to increment the counter first is that if the process is interrupted, it will lead to a skipped counter value, which is not a problem. If step 2 was done without step 1, it would lead to repeating a nonce, which is bad. With this scheme, you can shave a few bytes off the nonce length if the mode allows it, as long as the length is large enough for the number of messages that you'll ever encrypt.
If you don't have such a counter, then use the maximum nonce length and generate a random counter. The reason to use the maximum nonce length is that due to the birthday paradox, a random n-bit nonce is expected to repeat when the number of messages approaches 2n/2.
In either case, you need to store the nonce in the row.
¹ Assuming that everything is implemented correctly, e.g. random values need to be generated with a random generator that is appropriate for cryptography.
² As long as it isn't chosen in a way that depends on the key.

Related

Are AES keys just random bytes of a specific length or is there some sort of extra checks?

Since I want to scale up a simple website but I just need a simple encryption done through environment variables rather than setting up a Redis to hold the key.
I'm looking at this Converting Secret Key into a String and Vice Versa to do the retrieval.
I know I can export the string but I was wondering if any arbitrary bytes can be used so long as it meets the length requirement.
An AES key is a sequence of 16, 24, or 32 bytes chosen by a cryptographically secure random number generator. There are no checks other than the length.

Can fragments of encrypted data be decrypted?

Assume I encrypt byte array A, resulting in encrypted byte array E. I next split E into two smaller arrays, E1 and E2.
If I ONLY have either E1 and E2, can they be decrypted, or do both E1 AND E2 need to be present, in the correct sequence, in order for the data to be successfully decrypted? Can any useful information (i.e. some subset of A) be extracted from either E1 or E2 independently?
I realize this may depend on the encryption algorithm. I am primarily curious about common key pair algorithms such as RSA.
It depends mostly on the mode used by the encryption, though it depends on the algorithm as to what modes are possible.
Using a streaming mode (CTR or CFB or OFB) the answer is generally 'yes' -- you can decrypt whatever part of the stream you receive, though with feedback modes, some may be lost.
Using a block mode (ECB or CBC) the answer is 'somewhat' -- you can decode any complete blocks you get, but any partial blocks will be unrecoverable.
Using a cross-block mode (not a standard term), the answer will be 'no', as these modes are designed specifically for that property.
RSA has a large block size and is generally used in block mode to encrypt a single block (which contains a symmetric session key), so having only part of the one block will generally mean you won't get anything.

Crack sha256 when you know the pass form

Is it possible to write a code that can crack the sha256 hash when you know the form of password? For example the password form is *-********** which is 12-13 characters long and:
The first char is one number from 1 to 25
Second one is hyphen
In each char from the third one to the end, you can put a...z, A...Z and 0...9
After guessing each pass, code converts the pass to sha256 and see whether the result hash is equal to our hash or not and then print the correct pass.
I know all possible numbers is a big number (26+26+10)^10 but I want to know that:
Is it possible to write such code?
If yes, is it possible to run whole code in less than one day (because I think it takes a lot of time to complete the whole code)?
Since I can't ask you to write a code for me, how and where can I ask for this code?
You cannot "crack" a SHA256 hash no matter how much information you know about the plaintext (assuming by crack you mean derive the plaintext from the hash). Even if you knew the password you could not determine any procedure for reversing the hash. In technical terms, there is no known way to perform a preimage attack on a SHA256 hash.
That means you have to resort to guessing or brute forcing the password:
You have a prefix, which can be any value in [1-25]- and 10 additional characters in [a-zA-Z0-9]. That means the total number of possible passwords is: 25 * 62^10 or 20,982,484,146,708,505,600.
If you were able to compute and check a billion passwords per second it would take you 20,982,484,146 seconds to generate every possible hash. If you start now you'll be finished in about 665 years.
If you are able to leverage some more computing power and generate a trillion hashes per second it would only take a bit more than half a year. The good news is that computing hashes can be done in parallel, so it is easy to utilize multiple machines. The bad news is that kind of computing power isn't going to be cheap.
To answer your questions:
Is it possible to write such code? It is possible to write a program that will iterate over the entire range of possible passwords and check it against the hash(es) you want to determine the plaintext for.
If yes, is it possible to run whole code in less than one day. Yes, if you can compute and check around 10^15 hashes per second.
How and where can I ask for this code? This is the least of your problems.
Fortunately, since bitcoin uses sha256, it is pretty easy to find rough numbers on the amount of computing power it takes to generate the number of hashes you need.
If the numbers in this article are correct a Raspberry Pi can generate 2*10^5 hashes per second. I believe the newer Raspberry Pis are more powerful than that so I'm going to double that to 4*10^5. You need to generate about 10^15 hashes per second to be done in less than a day.
You're going to need 250,000,000 Raspberry Pis.

Is encrypting low variance values risky?

For example a credit card expiry month can be only of only twelve values. So a hacker would have a one in twelve chance of guessing the correct encrypted value of a month. If they knew this, would they be able to crack the encryption more quickly?
If this is the case, how many variations of a value are required to avoid this? How about a bank card number security code which is commonly only three digits?
If you use a proper cipher like AES in a proper way, then encrypting such values is completely safe.
This is because modes of operation that are considered secure (such as CBC and CTR) take an additional parameter called the initialization vector, which effectively randomizes the ciphertext even if the same plain text is encrypted multiple times.
Note that it's extremely important that the IV is used correctly. Every call of the encryption function must use a different IV. For CBC mode, the IV has to be unpredictable and preferably random, while CTR requires a unique IV (a random IV is usually not a bad choice for CTR either).
Good encryption means that if the user knows for example as you mentioned that the expiration month of a credit card is one of twelve values then it will limit the number of options by just that, and not more.
i.e.
If a hacker needs to guess three numbers, a, b, c, each of them can have values from 1 to 3.
The number of options will be 3*3*3 = 27.
Now the hacker finds out that the first number, a, is always the fixed value 2.
So the number of options is 1*3*3 = 9.
If revealing the value of the number a will result in limiting the number of options to a value less then 9 than you have been cracked, but in a strong model, if one of the numbers will be revealed then the number of options to be limited will be exactly to 9.
Now you are obviously not using only the exp. date for encryption, i guess.
I hope i was clear enough.

Is it possible to get identical SHA1 hash? [duplicate]

This question already has answers here:
Probability of SHA1 collisions
(3 answers)
Closed 6 years ago.
Given two different strings S1 and S2 (S1 != S2) is it possible that:
SHA1(S1) == SHA1(S2)
is True?
If yes - with what probability?
If not - why not?
Is there a upper bound on the length of a input string, for which the probability of getting duplicates is 0? OR is the calculation of SHA1 (hence probability of duplicates) independent of the length of the string?
The goal I am trying to achieve is to hash some sensitive ID string (possibly joined together with some other fields like parent ID), so that I can use the hash value as an ID instead (for example in the database).
Example:
Resource ID: X123
Parent ID: P123
I don't want to expose the nature of my resource identifies to allow client to see "X123-P123".
Instead I want to create a new column hash("X123-P123"), let's say it's AAAZZZ. Then the client can request resource with id AAAZZZ and not know about my internal id's etc.
What you describe is called a collision. Collisions necessarily exist, since SHA-1 accepts many more distinct messages as input that it can produce distinct outputs (SHA-1 may eat any string of bits up to 2^64 bits, but outputs only 160 bits; thus, at least one output value must pop up several times). This observation is valid for any function with an output smaller than its input, regardless of whether the function is a "good" hash function or not.
Assuming that SHA-1 behaves like a "random oracle" (a conceptual object which basically returns random values, with the sole restriction that once it has returned output v on input m, it must always thereafter return v on input m), then the probability of collision, for any two distinct strings S1 and S2, should be 2^(-160). Still under the assumption of SHA-1 behaving like a random oracle, if you collect many input strings, then you shall begin to observe collisions after having collected about 2^80 such strings.
(That's 2^80 and not 2^160 because, with 2^80 strings you can make about 2^159 pairs of strings. This is often called the "birthday paradox" because it comes as a surprise to most people when applied to collisions on birthdays. See the Wikipedia page on the subject.)
Now we strongly suspect that SHA-1 does not really behave like a random oracle, because the birthday-paradox approach is the optimal collision searching algorithm for a random oracle. Yet there is a published attack which should find a collision in about 2^63 steps, hence 2^17 = 131072 times faster than the birthday-paradox algorithm. Such an attack should not be doable on a true random oracle. Mind you, this attack has not been actually completed, it remains theoretical (some people tried but apparently could not find enough CPU power)(Update: as of early 2017, somebody did compute a SHA-1 collision with the above-mentioned method, and it worked exactly as predicted). Yet, the theory looks sound and it really seems that SHA-1 is not a random oracle. Correspondingly, as for the probability of collision, well, all bets are off.
As for your third question: for a function with a n-bit output, then there necessarily are collisions if you can input more than 2^n distinct messages, i.e. if the maximum input message length is greater than n. With a bound m lower than n, the answer is not as easy. If the function behaves as a random oracle, then the probability of the existence of a collision lowers with m, and not linearly, rather with a steep cutoff around m=n/2. This is the same analysis than the birthday paradox. With SHA-1, this means that if m < 80 then chances are that there is no collision, while m > 80 makes the existence of at least one collision very probable (with m > 160 this becomes a certainty).
Note that there is a difference between "there exists a collision" and "you find a collision". Even when a collision must exist, you still have your 2^(-160) probability every time you try. What the previous paragraph means is that such a probability is rather meaningless if you cannot (conceptually) try 2^160 pairs of strings, e.g. because you restrict yourself to strings of less than 80 bits.
Yes it is possible because of the pigeon hole principle.
Most hashes (also sha1) have a fixed output length, while the input is of arbitrary size. So if you try long enough, you can find them.
However, cryptographic hash functions (like the sha-family, the md-family, etc) are designed to minimize such collisions. The best attack known takes 2^63 attempts to find a collision, so the chance is 2^(-63) which is 0 in practice.
git uses SHA1 hashes as IDs and there are still no known SHA1 collisions in 2014. Obviously, the SHA1 algorithm is magic. I think it's a good bet that collisions don't exist for strings of your length, as they would have been discovered by now. However, if you don't trust magic and are not a betting man, you could generate random strings and associate them with your IDs in your DB. But if you do use SHA1 hashes and become the first to discover a collision, you can just change your system to use random strings at that time, retaining the SHA1 hashes as the "random" strings for legacy IDs.
A collision is almost always possible in a hashing function. SHA1, to date, has been pretty secure in generating unpredictable collisions. The danger is when collisions can be predicted, it's not necessary to know the original hash input to generate the same hash output.
For example, attacks against MD5 have been made against SSL server certificate signing last year, as exampled on the Security Now podcast episode 179. This allowed sophisticated attackers to generate a fake SSL server cert for a rogue web site and appear to be the reaol thing. For this reason, it is highly recommended to avoid purchasing MD5-signed certs.
What you are talking about is called a collision. Here is an article about SHA1 collisions:
http://www.rsa.com/rsalabs/node.asp?id=2927
Edit: So another answerer beat me to mentioning the pigeon hole principle LOL, but to clarify this is why it's called the pigeon hole principle, because if you have some holes cut out for carrier pigeons to nest in, but you have more pigeons than holes, then some of the pigeons(an input value) must share a hole(the output value).