The best way to search millions of fuzzy hashes - lucene

I have the spamsum composite hashes for about ten million files in a database table and I would like to find the files that are reasonably similar to each other. Spamsum hashes are composed of two CTPH hashes of maximum 64 bytes and they look like this:
384:w2mhnFnJF47jDnunEk3SlbJJ+SGfOypAYJwsn3gdqymefD4kkAGxqCfOTPi0ND:wemfOGxqCfOTPi0ND
They can be broken down into three sections (split the string on the colons):
Block size: 384 in the hash above
First signature: w2mhnFnJF47jDnunEk3SlbJJ+SGfOypAYJwsn3gdqymefD4kkAGxqCfOTPi0ND
Second signature: wemfOGxqCfOTPi0ND
Block size refers to the block size for the first signature, and the block size for the second signature is twice that of the first signature (here: 384 x 2 = 768). Each file has one of these composite hashes, which means each file has two signatures with different block sizes.
The spamsum signatures can be compared only if their block sizes correspond. That is to say that the composite hash above can be compared to any other composite hash that contains a signature with a block size of 384 or 768. The similarity of signature strings for hashes with similar block size can be takes as a measure of similarity between the files represented by the hashes.
So if we have:
file1.blk2 = 768
file1.sig2 = wemfOGxqCfOTPi0ND
file2.blk1 = 768
file2.sig1 = LsmfOGxqCfOTPi0ND
We can get a sense of the degree of similarity of the two files by calculating some weighted edit distance (like Levenshtein distance) for the two signatures. Here the two files seem to be pretty similar.
leven_dist(file1.sig2, file2.sig1) = 2
One can also calculate a normalized similarity score between two hashes (see the details here).
I would like to find any two files that are more than 70% similar based on these hashes, and I have a strong preference for using the available software packages (or APIs/SDKs), although I am not afraid of coding my way through the problem.
I have tried breaking the hashes down and indexing them using Lucene (4.7.0), but the search seems to be slow and tedious. Here is an example of the Lucene queries I have tried (for each single signature -- twice per hash and using the case-sensitive KeywordAnalyzer):
(blk1:768 AND sig1:wemfOGxqCfOTPi0ND~0.7) OR (blk2:768 AND sig2:wemfOGxqCfOTPi0ND~0.7)
It seems that Lucene's incredibly fast Levenshtein automata does not accept edit distance limits above 2 (I need it to support up to 0.7 x 64 ≃ 19) and that its normal editing distance algorithm is not optimized for long search terms (the brute force method used does not cut off calculation once the distance limit is reached.) That said, it may be that my query is not optimized for what I want to do, so don't hesitate to correct me on that.
I am wondering whether I can accomplish what I need using any of the algorithms offered by Lucene, instead of directly calculating the editing distance. I have heard that BK-trees are the best way to index for such searches, but I don't know of the available implementations of the algorithm (Does Lucene use those at all?). I have also heard that a probable solution is to narrow down the search list using n-gram methods but I am not sure how that compares to editing distance calculation in terms of inclusiveness and speed (I am pretty sure Lucene supports that one). And by the way, is there a way to have Lucene run a term search in the parallel mode?
Given that I am using Lucene only to pre-match the hashes and that I calculate the real similarity score using the appropriate algorithm later, I just need a method that is at least as inclusive as Levenshtein distance used in similarity score calculation -- that is, I don't want the pre-matching method to exclude hashes that would be flagged as matches by the scoring algorithm.
Any help/theory/reference/code or clue to start with is appreciated.

This is not a definitive answer to the question, but I have tried a number of methods ever since. I am assuming the hashes are saved in a database, but the suggestions remain valid for in-memory data structures as well.
Save all signatures (2 per hash) along with their corresponding block sizes in a separate child table. Since only signatures of the same size can be compared with each other, you can filter the table by block size before starting to compare the signatures.
Reduce all the repetitive sequences of more than three characters to three characters ('bbbbb' -> 'bbb'). Spamsum's comparison algorithm does this automatically.
Spamsum uses a rolling window of 7 to compare signatures, and won't compare any two signatures that do not have a 7-character overlap after eliminating excessive repetitions. If you are using a database that support lists/arrays as fields, create a field with a list of all possible 7-character sequences extracted from each signature. Then create the fastest exact match index you have access to on this field. Before trying to find the distance of two signatures, first try to do exact matches over this field (any seven-gram in common?).
The last step I am experimenting with is to save signatures and their seven-grams as the two modes of a bipartite graph, projecting the graph into single mode (composed of hashes only), and then calculating Levenshtein distance only on adjacent nodes with similar block sizes.
The above steps do a good pre-matching and substantially reduce the number of signatures each signature has to be compared with. It is only after these that the the modified Levenshtein/Damreau distance has to be calculated.

Related

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

is SHA-512 collision resistant?

According to the books that i have read, it says that S.H.A(Secure Hash Algorithm) is collision resistant.But if the input space is a 1024 bit number and the output space is a 512 bit message digest then shouldn't it be colliding for
(2^1024)/(2^512) times? As the range is lesser than the domain being mapped there should have been collisions. please explain where i am going wrong.
The chance for a collision does not depend on the input size. The chance to a 512-bit hash collision is 1.4×10^77, see Probability table
Maybe your book has also mentioned the definition of collision resistance? It does not mean that no collisions are created (which is clearly not the case), but that given a hash you are not able to create a message easily that produces this hash.
a hash function H is collision resistant if it is hard to find two
inputs that hash to the same output; that is, two inputs a and b such
that H(a) = H(b), and a ≠ b
From Wikipedia
As you describe: Since the input space (arbitrary size) is larger than the output space (e.g. 512bit for sha512), there always exist collisions.
"Collision resistant" means, it is adequately unlikely for a collision to be found.
Your confusion is answered when considering how large the output space "512 bits" really is:
2^512 (the number of possible configurations of a 512 bit array) is of the order 10^154.
For comparison: The number of atoms in the visible universe is somewhere in the range of 10^80.
A million is 10^6.
So a million of our 'visible universes' has 10^86 atoms.
A million times a million universes has 10^92 atoms.
If you could store a single 512 bit value on a single atom, how many universes would you need to have all possible 512 bit has values stored?
Starting with a specific 512bit number (and assuming the has function is not broken), the probability p to obtain a collision is assuming you can produce new hashes with a rate R and have the total time of t to do this is:
p = R*t/(2^(512/2))
(The exponent is halved, see "birthday attach". The expected search space for a success is to find a collision in n bits is n/2.)
Let's plugin in some example numbers:
The has rate of the bitcoin network is currently about R = 200*10^15 / s (200 million terrahashes per second).
Consider the situation that since the beginning of the universe the bitcoin network's current hashing capacity would have been available for the sole purpose of finding a collision for a specific hash value, i.e. for an available time of t=13.787*10^9 years,
then the probability that a collision would have been found by now is about 7 × 10^-41 %
Again, it is hard to appreciate how small this number is.
Edit: A similar question with a good answer is found here: https://crypto.stackexchange.com/questions/89558/are-sha-256-and-sha-512-collision-resistant

Similarity matching algorithm

I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.
For example, if a product has:
Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new
Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.
I'm open to suggestions as the project is just starting.
One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:
Colors Weight Height Condition
00011011000 10110110 10001100 01
Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).
The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.
Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.
Further question: also if I could use different weights for each attribute, it would be awesome.
You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.
Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.
You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.
What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.
You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.
What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements.
When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).
Simple example of BK-tree (from the second link)
BOOK
/ \
/(1) \(4)
/ \
BOOKS CAKE
/ / \
/(2) /(1) \(2)
/ | |
BOO CAPE CART
Your metric should be Hamming distance (count of differences between bit representations of two objects).
BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?
BK-tree with details:
https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/

VB.NET Comparing files with Levenshtein algorithm

I'd like to use the Levenshtein algorithm to compare two files in VB.NET. I know I can use an MD5 hash to determine if they're different, but I want to know HOW MUCH different the two files are. The files I'm working with are both around 250 megs. I've experimented with different ways of doing this and I've realized I really can't load both files into memory (all kinds of string-related issues). So I figured I'd just stream the bytes I need as I go. Fine. But the implementations that I've found of the Levenshtein algorithm all dimension a matrix that's length 1 * length 2 in size, which in this case is impossible to work with. I've heard there's a way to do this with just two vectors instead of the whole matrix.
How can I compute Levenshtein distance of two large files without declaring a matrix that's the product of their file sizes?
Note that the values in each row of the Levenshtein matrix depend only on the values in the row above it. This means that you only need two one-dimensional arrays: one contains the values of the current row; the other is populated with the new values that you can compute from the current row. Then, you swap their roles (the "new" row becomes the "current" row and vice versa) and continue.
Note that this approach only lets you compute the Levenshtein distance (which seems to be what you want); it cannot tell you which operations must be done in order to transform one string into the other. There exists a very clever modification of the algorithm that lets you reconstruct the edit operations without using nm memory, but I've forgotten how it works.

Is it possible to get identical SHA1 hash? [duplicate]

This question already has answers here:
Probability of SHA1 collisions
(3 answers)
Closed 6 years ago.
Given two different strings S1 and S2 (S1 != S2) is it possible that:
SHA1(S1) == SHA1(S2)
is True?
If yes - with what probability?
If not - why not?
Is there a upper bound on the length of a input string, for which the probability of getting duplicates is 0? OR is the calculation of SHA1 (hence probability of duplicates) independent of the length of the string?
The goal I am trying to achieve is to hash some sensitive ID string (possibly joined together with some other fields like parent ID), so that I can use the hash value as an ID instead (for example in the database).
Example:
Resource ID: X123
Parent ID: P123
I don't want to expose the nature of my resource identifies to allow client to see "X123-P123".
Instead I want to create a new column hash("X123-P123"), let's say it's AAAZZZ. Then the client can request resource with id AAAZZZ and not know about my internal id's etc.
What you describe is called a collision. Collisions necessarily exist, since SHA-1 accepts many more distinct messages as input that it can produce distinct outputs (SHA-1 may eat any string of bits up to 2^64 bits, but outputs only 160 bits; thus, at least one output value must pop up several times). This observation is valid for any function with an output smaller than its input, regardless of whether the function is a "good" hash function or not.
Assuming that SHA-1 behaves like a "random oracle" (a conceptual object which basically returns random values, with the sole restriction that once it has returned output v on input m, it must always thereafter return v on input m), then the probability of collision, for any two distinct strings S1 and S2, should be 2^(-160). Still under the assumption of SHA-1 behaving like a random oracle, if you collect many input strings, then you shall begin to observe collisions after having collected about 2^80 such strings.
(That's 2^80 and not 2^160 because, with 2^80 strings you can make about 2^159 pairs of strings. This is often called the "birthday paradox" because it comes as a surprise to most people when applied to collisions on birthdays. See the Wikipedia page on the subject.)
Now we strongly suspect that SHA-1 does not really behave like a random oracle, because the birthday-paradox approach is the optimal collision searching algorithm for a random oracle. Yet there is a published attack which should find a collision in about 2^63 steps, hence 2^17 = 131072 times faster than the birthday-paradox algorithm. Such an attack should not be doable on a true random oracle. Mind you, this attack has not been actually completed, it remains theoretical (some people tried but apparently could not find enough CPU power)(Update: as of early 2017, somebody did compute a SHA-1 collision with the above-mentioned method, and it worked exactly as predicted). Yet, the theory looks sound and it really seems that SHA-1 is not a random oracle. Correspondingly, as for the probability of collision, well, all bets are off.
As for your third question: for a function with a n-bit output, then there necessarily are collisions if you can input more than 2^n distinct messages, i.e. if the maximum input message length is greater than n. With a bound m lower than n, the answer is not as easy. If the function behaves as a random oracle, then the probability of the existence of a collision lowers with m, and not linearly, rather with a steep cutoff around m=n/2. This is the same analysis than the birthday paradox. With SHA-1, this means that if m < 80 then chances are that there is no collision, while m > 80 makes the existence of at least one collision very probable (with m > 160 this becomes a certainty).
Note that there is a difference between "there exists a collision" and "you find a collision". Even when a collision must exist, you still have your 2^(-160) probability every time you try. What the previous paragraph means is that such a probability is rather meaningless if you cannot (conceptually) try 2^160 pairs of strings, e.g. because you restrict yourself to strings of less than 80 bits.
Yes it is possible because of the pigeon hole principle.
Most hashes (also sha1) have a fixed output length, while the input is of arbitrary size. So if you try long enough, you can find them.
However, cryptographic hash functions (like the sha-family, the md-family, etc) are designed to minimize such collisions. The best attack known takes 2^63 attempts to find a collision, so the chance is 2^(-63) which is 0 in practice.
git uses SHA1 hashes as IDs and there are still no known SHA1 collisions in 2014. Obviously, the SHA1 algorithm is magic. I think it's a good bet that collisions don't exist for strings of your length, as they would have been discovered by now. However, if you don't trust magic and are not a betting man, you could generate random strings and associate them with your IDs in your DB. But if you do use SHA1 hashes and become the first to discover a collision, you can just change your system to use random strings at that time, retaining the SHA1 hashes as the "random" strings for legacy IDs.
A collision is almost always possible in a hashing function. SHA1, to date, has been pretty secure in generating unpredictable collisions. The danger is when collisions can be predicted, it's not necessary to know the original hash input to generate the same hash output.
For example, attacks against MD5 have been made against SSL server certificate signing last year, as exampled on the Security Now podcast episode 179. This allowed sophisticated attackers to generate a fake SSL server cert for a rogue web site and appear to be the reaol thing. For this reason, it is highly recommended to avoid purchasing MD5-signed certs.
What you are talking about is called a collision. Here is an article about SHA1 collisions:
http://www.rsa.com/rsalabs/node.asp?id=2927
Edit: So another answerer beat me to mentioning the pigeon hole principle LOL, but to clarify this is why it's called the pigeon hole principle, because if you have some holes cut out for carrier pigeons to nest in, but you have more pigeons than holes, then some of the pigeons(an input value) must share a hole(the output value).