Parallelizable hashing algorithm where size and order of sub-strings is irrelevant - properties

EDIT
Here is the problem I am trying to solve:
I have a string broken up into multiple parts. These parts are not of equal, or predictable length. Each part will have a hash value. When I concatenate parts I want to be able to use the hash values from each part to quickly get the hash value for the parts together. In addition the hash generated by putting the parts together must match the hash generated if the string were hashed as a whole.
Basically I want a hashing algorithm where the parts of the data being hashed can be hashed in parallel, and I do not want the order or length of the pieces to matter. I am not breaking up the string, but rather receiving it in unpredictable chunks in an unpredictable order.
I am willing to ensure an elevated collision rate, so long as it is not too elevated. I am also ok with a slightly slower algorithm as it is hardly noticeable on small strings, and done in parallel for large strings.
I am familiar with a few hashing algorithms, however I currently have a use-case for a hash algorithm with the property that the sum of two hashes is equal to a hash of the sum of the two items.
Requirements/givens
This algorithm will be hashing byte-strings with length of at least 1 byte
hash("ab") = hash('a') + hash('b')
Collisions between strings with the same characters in different order is ok
Generated hash should be an integer of native size (usually 32/64 bits)
String may contain any character from 0-256 (length is known, not \0 terminated)
The ascii alpha-numeric characters will be by far the most used
A disproportionate number of strings will be 1-8 ASCII characters
A very tiny percentage of the strings will actually contain bytes with values at or above 127
If this is a type of algorithm that has terminology associated with it, I would love to know that terminology. If I knew what a proper term/name for this type of hashing algorithm was it would be much easier to google.
I am thinking the simplest way to achieve this is:
Any byte's hash should be its value, normalized to <128 (if >128 subtract 128)
To get the hash of a string you normalize each byte to <128 and add it to the key
Depending on key size I may need to limit how many characters are used to hash to avoid overflow

I don't see anything wrong with just adding each (unsigned) byte value to create a hash which is just the sum of all the characters. There is nothing wrong with having an overflow: even if you reach the 32/64 bit limit (and it would have to be a VERY/EXTREMELY long string to do this) the overflow into a negative number won't matter in 2's complement arithmetic. As this is a linear process it doesn't matter how you split your string.

Related

Map words to numbers

I am doing indexing of data in my IRE (Information Retrieval and Extraction) course. Now instead of storing terms in the index, I am storing termID which is a mapping corresponding to the term. The size of term, if the length of the term is 15, would be 15 bytes i.e. 120 bits while if I use termID instead of term then I can definitely store it in less than 120 bits. One of the possible ways is to maintain a dictionary of the (term, termID) where termID would be from 1..n where n is the number of terms. The problems with this method is:
I have to keep this dictionary in the ram and the dictionary size can be in GBs.
To find termID corresponding to a term, it will take O(log(n)) where n is the number of terms in the dictionary.
Can I make some function which takes a term as an input and returns the mapping (encryption) in O(1) ?. It is okay if there are few collisions (Just guessing that a few collisions in exchange of speed and memory is a good trade-off. BTW I don't know how much it will effect my search results).
Is there any other better way to do this?
I think you gave the answer already more or less by saying "it is ok if there are a few collisions". The trick is hashing. You can first reduce the number of "characters" in your search terms. E.g., drop numbers, and special characters. Afterwards you can merge Upper and lower-case characters. Finally you could apply some simple replacements e.g. replacing the german ΓΌ bei ue (which is actually there origin). After doing so you have probably sth. like 32bit. You can then represent an four character string in a single byte. If you reserve 4 bytes for each words you need to deal with the longer words. There you can basically resort to xor each 4byte block.
An alternative approach would be to do something hybrid for the dictionary. If you would build a dictionary for only the 10k most frequent words you are most likely covering already most of the texts. Hence, you only need to keep parts of your dictionary in memory, while for most of the words you can use dictionary on hardisc or maybe even ignore them.

Why in some cases are used only the first x chars of a md5 hash instead of using all of them?

For example commit list on GitHub shows only first 10, or this line from tornadoweb which uses only 5
return static_url_prefix + path + "?v=" + hashes[abs_path][:5]
Are only the first 5 chars enough to make sure that 2 different hashes for 2 different files won't collide?
LE: The example above from tornadoweb uses md5 hash for generating a query sting for static file caching.
In general, No.
In fact, even if a full MD5 hash were given, it wouldn't be enough to prevent malicious users from generating collisions---MD5 is broken. Even with a better hash function, five characters is not enough.
But sometimes you can get away with it.
I'm not sure exactly what the context of the specific example you provided is. However, to answer your more general question, if there aren't bad guys actively trying to cause collisions, than using part of the hash is probably okay. In particular, given 5 hex characters (20 bits), you won't expect collisions before around 2^(20/2) = 2^10 ~ one thousand values are hashed. This is a consequence of the the Birthday paradox.
The previous paragraph assumes the hash function is essentially random. This is not an assumption anyone trying to make a cryptographically secure system should make. But as long as no one is intentionally trying to create collisions, it's a reasonable heuristic.

Hash Function for 2D Barcode Data

I am writing a string of about 120 characters to a 2D barcode. Along with other text, the string contains a unique ticket number. I want to ensure that someone doesn't generate counterfeit tickets by reading the 2D barcode and generation their own barcoded tickets.
I would like to hash the string and append the hash value to what gets embedded in the barcode. That way I can compare the two on reading and see if the data had been tampered with. I have seen several hash function that return 64 bytes and up but the more characters you embed in a 2D barcode the bigger the barcode image becomes. I would like an algorithm that returns a fairly small value. It would also be nice if I could provide the function my own key. Collision is not that big of a deal. This isn't any kind of national security application.
Any suggestions?
Use any standard hash function. Take the 120-character string; append your own secret value; feed it into SHA-1 or MD5 or whatever hash function you have handy or feel like implementing; then just take the first however-many bits you want and use that as your value. (If you need ASCII characters, then I suggest that you take groups of 6 bits and use a base-64 encoding.)
If the hash you're using is any good (as, e.g., MD5 and SHA-1 are; MD5 shouldn't be used for serious cryptographic algorithms these days but it sounds like it's good enough for your needs) then any set of bits from it will be "good enough" in the sense that no other function producing that many bits will be much better.
(Warning: For serious cryptographic use, you should be a little more careful. Look at, e.g., http://en.wikipedia.org/wiki/HMAC for more information. From your description, I do not believe you need to worry about such things.)

Creating unique hash code (string) in SQL Server from a combination of two or more columns (of different data types)

I would like to create unique string columns (32 characters in length) from combination of columns with different data types in SQL Server 2005.
I have found out the solution elsewhere in StackOverflow
SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', 'HelloWorld')), 3, 32)
The answer thread is here
With HASBYTES you can create SHA1 hashes, that have 20 bytes, and you can create MD5 hashes, 16 bytes. There are various combination algorithms that can produce arbitrary length material by repeated hash operations, like the PRF of TLS (see RFC 2246).
This should be enough to get you started. You need to define what '32 characters' mean, since hash functions produce bytes not characters. Also, you need to internalize that no algorithm can possibly produce hashes of fixed length w/o collisions (guaranteed 'unique'). Although at 32 bytes length (assuming that by 'characters' you mean bytes) the theoretical collision probability of 50% is at 4x1038 hashed elements (see birthday problem), that assumes a perfect distribution for your 32 bytes output hash function, which you're not going to achieve.

Is it possible to get identical SHA1 hash? [duplicate]

This question already has answers here:
Probability of SHA1 collisions
(3 answers)
Closed 6 years ago.
Given two different strings S1 and S2 (S1 != S2) is it possible that:
SHA1(S1) == SHA1(S2)
is True?
If yes - with what probability?
If not - why not?
Is there a upper bound on the length of a input string, for which the probability of getting duplicates is 0? OR is the calculation of SHA1 (hence probability of duplicates) independent of the length of the string?
The goal I am trying to achieve is to hash some sensitive ID string (possibly joined together with some other fields like parent ID), so that I can use the hash value as an ID instead (for example in the database).
Example:
Resource ID: X123
Parent ID: P123
I don't want to expose the nature of my resource identifies to allow client to see "X123-P123".
Instead I want to create a new column hash("X123-P123"), let's say it's AAAZZZ. Then the client can request resource with id AAAZZZ and not know about my internal id's etc.
What you describe is called a collision. Collisions necessarily exist, since SHA-1 accepts many more distinct messages as input that it can produce distinct outputs (SHA-1 may eat any string of bits up to 2^64 bits, but outputs only 160 bits; thus, at least one output value must pop up several times). This observation is valid for any function with an output smaller than its input, regardless of whether the function is a "good" hash function or not.
Assuming that SHA-1 behaves like a "random oracle" (a conceptual object which basically returns random values, with the sole restriction that once it has returned output v on input m, it must always thereafter return v on input m), then the probability of collision, for any two distinct strings S1 and S2, should be 2^(-160). Still under the assumption of SHA-1 behaving like a random oracle, if you collect many input strings, then you shall begin to observe collisions after having collected about 2^80 such strings.
(That's 2^80 and not 2^160 because, with 2^80 strings you can make about 2^159 pairs of strings. This is often called the "birthday paradox" because it comes as a surprise to most people when applied to collisions on birthdays. See the Wikipedia page on the subject.)
Now we strongly suspect that SHA-1 does not really behave like a random oracle, because the birthday-paradox approach is the optimal collision searching algorithm for a random oracle. Yet there is a published attack which should find a collision in about 2^63 steps, hence 2^17 = 131072 times faster than the birthday-paradox algorithm. Such an attack should not be doable on a true random oracle. Mind you, this attack has not been actually completed, it remains theoretical (some people tried but apparently could not find enough CPU power)(Update: as of early 2017, somebody did compute a SHA-1 collision with the above-mentioned method, and it worked exactly as predicted). Yet, the theory looks sound and it really seems that SHA-1 is not a random oracle. Correspondingly, as for the probability of collision, well, all bets are off.
As for your third question: for a function with a n-bit output, then there necessarily are collisions if you can input more than 2^n distinct messages, i.e. if the maximum input message length is greater than n. With a bound m lower than n, the answer is not as easy. If the function behaves as a random oracle, then the probability of the existence of a collision lowers with m, and not linearly, rather with a steep cutoff around m=n/2. This is the same analysis than the birthday paradox. With SHA-1, this means that if m < 80 then chances are that there is no collision, while m > 80 makes the existence of at least one collision very probable (with m > 160 this becomes a certainty).
Note that there is a difference between "there exists a collision" and "you find a collision". Even when a collision must exist, you still have your 2^(-160) probability every time you try. What the previous paragraph means is that such a probability is rather meaningless if you cannot (conceptually) try 2^160 pairs of strings, e.g. because you restrict yourself to strings of less than 80 bits.
Yes it is possible because of the pigeon hole principle.
Most hashes (also sha1) have a fixed output length, while the input is of arbitrary size. So if you try long enough, you can find them.
However, cryptographic hash functions (like the sha-family, the md-family, etc) are designed to minimize such collisions. The best attack known takes 2^63 attempts to find a collision, so the chance is 2^(-63) which is 0 in practice.
git uses SHA1 hashes as IDs and there are still no known SHA1 collisions in 2014. Obviously, the SHA1 algorithm is magic. I think it's a good bet that collisions don't exist for strings of your length, as they would have been discovered by now. However, if you don't trust magic and are not a betting man, you could generate random strings and associate them with your IDs in your DB. But if you do use SHA1 hashes and become the first to discover a collision, you can just change your system to use random strings at that time, retaining the SHA1 hashes as the "random" strings for legacy IDs.
A collision is almost always possible in a hashing function. SHA1, to date, has been pretty secure in generating unpredictable collisions. The danger is when collisions can be predicted, it's not necessary to know the original hash input to generate the same hash output.
For example, attacks against MD5 have been made against SSL server certificate signing last year, as exampled on the Security Now podcast episode 179. This allowed sophisticated attackers to generate a fake SSL server cert for a rogue web site and appear to be the reaol thing. For this reason, it is highly recommended to avoid purchasing MD5-signed certs.
What you are talking about is called a collision. Here is an article about SHA1 collisions:
http://www.rsa.com/rsalabs/node.asp?id=2927
Edit: So another answerer beat me to mentioning the pigeon hole principle LOL, but to clarify this is why it's called the pigeon hole principle, because if you have some holes cut out for carrier pigeons to nest in, but you have more pigeons than holes, then some of the pigeons(an input value) must share a hole(the output value).