Whether cracking SHA1 will be easier if I know part of the input? - cryptography

Assume that I know 80% of SHA1 input. Whether cracking remaining 20% from the SHA1 hash value is easier than cracking the whole input? If it is so, by what percentage?
Eg: I know the x's in the input SHA1(xxxxxxxxyy)=hash value

Assume there are 10 bytes in the input. To crack the whole input, we'd have to try 2^(10*8) inputs. With 80% given, we only have to try 2^(2*8) inputs. That's about a quintillion times fewer. If the input size goes up, the ratio gets even larger.
SHA1 is irreversible today with about 100 unknown bits (12 bytes) in the input. With only 20% of the input unknown, that means the input size would need to be about 500 bits to be secure, or about 62 bytes.
It actually matters whether the unknown part is at the beginning or the end. Each 32 bits of known data at the beginning reduces the number of needed operations by a bit more than you might expect because some of the calculations can be re-used.

Related

Is there a CRC or criptographic function for generating smaller size unique results from unique inputs?

I have a manufacturer unique number ID of 128 bits that I cannot change and it's size is just too long for our purpose (2^128). This is on some embedded micro controller.
One idea is to compute a (run time) CRC32 or hash for narrowing the results but I am not sure for unicity CRC32 as a example: this can be unique for 2^32
Or what king of cryptography function I can use for guarantee unicity of 32 bits output based on unique input?
Thanks for clarifications,
If you know all these ID values in advance, then you can check them using a hash table. You can save space by storing only as many bits of each hash value as are necessary to tell them apart if them happen to land in the same bucket.
If not, then you're going to have a hard time, I'm afraid.
Let's assume these 128-bit IDs are produced as the output of a cryptographic hash function (e.g., MD5), so each ID resembles 128 bits chosen uniformly at random.
If you reduce these to 32-bit values, then the best you can hope to achieve is a set of 32-bit numbers where each bit is 0 or 1 with uniform probability. You could do this by calculating the CRC32 checksum, or by simply discarding 96 bits — it makes no difference.
32 bits is not enough enough to avoid collisions. The collision probability exceeds 1 in a million after just 93 inputs, and 1 in a thousand after 2,900 inputs. After 77,000 inputs, the collision probability reaches 50%. (Source).
So instead, your only real options are to somehow reverse-engineer the ID values into something smaller, or implement some external means of replacing these IDs with sequential integers (e.g., using a hash table).

What is the difference between “SHA-2” and “SHA-256”

I'm a bit confused on the difference between SHA-2 and SHA-256 and often hear them used interchangeably. I think SHA-2 a "family" of hash algorithms and SHA-256 a specific algorithm in that family. Can anyone clear up the confusion.
The SHA-2 family consists of multiple closely related hash functions. It is essentially a single algorithm in which a few minor parameters are different among the variants.
The initial spec only covered 224, 256, 384 and 512 bit variants.
The most significant difference between the variants is that some are 32 bit variants and some are 64 bit variants. In terms of performance this is the only difference that matters.
On a 32 bit CPU SHA-224 and SHA-256 will be a lot faster than the other variants because they are the only 32 bit variants in the SHA-2 family. Executing the 64 bit variants on a 32 bit CPU will be slow due to the added complexity of performing 64 bit operations on a 32 bit CPU.
On a 64 bit CPU SHA-224 and SHA-256 will be a little slower than the other variants. This is because due to only processing 32 bits at a time, they will have to perform more operations in order to make it through the same number of bytes. You do not get quite a doubling in speed from switching to a 64 bit variant because the 64 bit variants do have a larger number of rounds than the 32 bit variants.
The internal state is 256 bits in size for the two 32 bit variants and 512 bits in size for all four 64 bit variants. So the number of possible sizes for the internal state is less than the number of possible sizes for the final output. Going from a large internal state to a smaller output can be good or bad depending on your point of view.
If you keep the output size fixed it can in general be expected that increasing the size of the internal state will improve security. If you keep the size of the internal state fixed and decrease the size of the output, collisions become more likely, but length extension attacks may become easier. Making the output size larger than the internal state would be pointless.
Due to the 64 bit variants being both faster (on 64 bit CPUs) and likely to be more secure (due to larger internal state), two new variants were introduced using 64 bit words but shorter outputs. Those are the ones known as 512/224 and 512/256.
The reasons for wanting variants with output that much shorter than the internal state is usually either that for some usages it is impractical to use such a long output or that the output need to be used as key for some algorithm that takes an input of a certain size.
Simply truncating the final output to your desired length is also possible. For example a HMAC construction specify truncating the final hash output to the desired MAC length. Due to HMAC feeding the output of one invocation of the hash as input to another invocation it means that using a hash with shorter output results in a HMAC with less internal state. For this reason it is likely to be slightly more secure to use HMAC-SHA-512 and truncate the output to 384 bits than to use HMAC-SHA-384.
The final output of SHA-2 is simply the internal state (after processing length extended input) truncated to the desired number of output bits. The reason SHA-384 and SHA-512 on the same input look so different is that a different IV is specified for each of the variants.
Wikipedia:
The SHA-2 family consists of six hash functions with digests (hash
values) that are 224, 256, 384 or 512 bits: SHA-224, SHA-256, SHA-384,
SHA-512, SHA-512/224, SHA-512/256.

Encoding - Efficiently send sparse boolean array

I have a 256 x 256 boolean array. These array is constantly changing and set bits are practically randomly distributed.
I need to send a current list of the set bits to many clients as they request them.
Following numbers are approximations.
If I send the coordinates for each set bit:
set bits data transfer (bytes)
0 0
100 200
300 600
500 1000
1000 2000
If I send the distance (scanning from left to right) to the next set bit:
set bits data transfer (bytes)
0 0
100 256
300 300
500 500
1000 1000
The typical number of bits that are set in this sparse array is around 300-500, so the second solution is better.
Is there a way I can do better than this without much added processing overhead?
Since you say "practically randomly distributed", let's assume that each location is a Bernoulli trial with probability p. p is chosen to get the fill rate you expect. You can think of the length of a "run" (your option 2) as the number of Bernoulli trials necessary to get a success. It turns out this number of trials follows the Geometric distribution (with probability p). http://en.wikipedia.org/wiki/Geometric_distribution
What you've done so far in option #2 is to recognize the maximum length of the run in each case of p, and reserve that many bits to send all of them. Note that this maximum length is still just a probability, and the scheme will fail if you get REALLY REALLY unlucky, and all your bits are clustered at the beginning and end.
As #Mike Dunlavey recommends in the comment, Huffman coding, or some other form of entropy coding, can redistribute the bits spent according to the frequency of the length. That is, short runs are much more common, so use fewer bits to send those lengths. The theoretical limit for this encoding efficiency is the "entropy" of the distribution, which you can look up on that Wikipedia page, and evaluate for different probabilities. In your case, this entropy ranges from 7.5 bits per run (for 1000 entries) to 10.8 bits per run (for 100).
Actually, this means you can't do much better than you're currently doing for the 1000 entry case. 8 bits = 1 byte per value. For the case of 100 entries, you're currently spending 20.5 bits per run instead of the theoretically possible 10.8, so that end has the highest chance for improvement. And in the case of 300: I think you haven't reserved enough bits to represent these sequences. The entropy comes out to 9.23 bits per pixel, and you're currently sending 8. You will find many cases where the space between true exceeds 256, which will overflow your representation.
All of this, of course, assumes that things really are random. If they're not, you need a different entropy calculation. You can always compute the entropy right out of your data with a histogram, and decide if it's worth pursuing a more complicated option.
Finally, also note that real-life entropy coders only approximate the entropy. Huffman coding, for example, has to assign an integer number of bits to each run length. Arithmetic coding can assign fractional bits.

How would you most efficiently store latitude and longitude data?

This question comes from a homework assignment I was given. You can base your storage system off of one of the three following formats:
DD MM SS.S
DD MM.MMM
DD.DDDDD
You want to maximize the amount of data you can store by using as few bytes as possible.
My solution is based off the first format. I used 3 bytes for latitude: 8 bits for the DD (-90 to 90), 6 bits for the MM (0-59), and 10 bits for the SS.S (0-59.9). I then used 25 bits for the longitude: 9 bits for the DDD (-180 to 180), 6 bits for the MM, and 10 for the SS.S. This solution doesn't fit nicely on a byte border, but I figured the next reading can be stored immediately following the previous one, and 8 readings would use only 49 bytes.
I'm curious what methods others can come up. Is there a more efficient method to storing this data? As a note, I considered an offset based storage, but the problem gave no indication of how much the values may change between readings, so I'm assuming any change is possible.
Your suggested method is not optimal. You are using 10 bits (1024 possible values) to store a value in the range (0..599). This is a waste of space.
If you'll use 3 bytes for latitude, you should map the range [0, 2^24-1] to the range [-90, 90]. Hence each of the 2^24 values represents 180/2^24 degrees, which is 0.086 seconds.
If you want only 0.1 second accuracy, you'll need 23 bits for latitudes and 24 bits for longitudes (you'll get 0.077 seconds accuracy). That's 47 bit total instead of your 49 bits, with better accuracy.
Can we do even better?
The exact number of bits needed for 0.1 second accuracy is log2(180*60*60*10 * 360*60*60*10) < 46.256. Which means that you can use 46256 bits (5782 bytes) to store 1000 (lat,lon) pairs, but the mathematics involved will require dealing with very large integers.
Can we do even better?
It depends. If your data set has concentrations, you can store only some points and relative distances from these points, using less bits. Clustering algorithms should be used.
Sticking to existing technology:
If you used half precision floating point numbers to store only the DD.DDDDD data, you can be a lot more space-efficent, but you'd have to accept an exponent bias of 15, which means: The coordinates stored might not be exact, but at an offset from the original value.
This is due to the way floating point numbers are stored, essentially: A normalized significant is multiplied by an exponent to result in a number, instead of just storing a single value (as in integer numbers, the way you calculated the numbers for your solution).
The next highest commonly used floating point number mechanism uses 32 bits (the type "float" in many programming languages) - still efficient, but larger than your custom format.
If, however, you would design your own custom floating point type as well, and you gradually added more bits, your results would become more exact and it would STILL be more efficient than the solution you first found. Just play around with the number of bits used for significant and exponent, and find out how close your fp approximations come to the desired result in degrees!
Well, if this is for a large number of readings, then you may try a differential approach. Start with an absolute location, and then start saving incremental changes, which should ideally require less bits, depending on the nature of the changes. This is effectively compressing the stream. But somehow I don't think that's what this homework is about.

binary string with random shift-cryptography

Hello
I have a binary string length of n.My goal is that all bit in string will be equal to "1".
I can flip every bit of the string that I want but after fliping the bits of the string it does random circular shift.(shift length evenly distributed between 0...n-1)
I have no way to know what is a state of the bit not initianly nor in middle of process I only know when they all is "1"
As I understand there should be some strategy that guarantees me that I do all the permuatations in truth table of this string.
Thank you
Flip bit 1 until all are set to 1. I don't see there being anything faster without testing the bits.
Georg has the best answer, if the string is shifted randomly (I assume by 0..n bits evenly distributed) his strategy of always flipping the first bit will sooner or later succeed.
Unfortunately that strategy may take very long time depending on the length of the string.
The expected value of the number of bits being set to 1 will be n/2 in average, so the probability that a bit flip will be successful is 0.5, for each bit being set that probability decreases by 1/n.
The process could be viewed as a markov chain where the probability for being at state 0xff...ff where all bits are set is calculcated and thus the number of trials in average required to reach that state can be calculated.