Locality sensitive hashing - what happens when a bucket is empty? - locality-sensitive-hash

Assume I've constructed an LSH database according to some set of hashes, and I'm now beginning to query the database to find approximate nearest neighbors.
Are there any guidelines to what happens when you compute the hash for a query point, and the corresponding bucket is empty? Similarly, say I want to find the 5 approximate nearest neighbors, and the bucket has only 4 other data points?

I believe getting too few points for a retrieval means that you have too many buckets for your training data. And that is application dependent of course. Take a look at LSH toolbox by Greg Shakhnarovich implementation and his README file. In this implementation, fewer hash functions (smaller k) means fuller buckets, and that in turn means slower LSH.

Related

Best practice to store List<T> in StackExchange.Redis

I am trying to find best practice(efficient) way of storing set of List objects against ReportingDate key.
List could be serailised as Xml/DataContract or ProtoBuf....
And given some of the data could be big (for that slice of key):
I was wondering if there is any of getting data from redis cache in IEnum/streamed fashion? Atm we using ProtoBuf.NET to have file based cache. And we retrieve data into mem in streamed fashion (we also have an option of selecting what props/fields we want in that T object as ProtoBuf allows us to do it)
Is there any way can force (after some inactivity) certain part of the data to be offloaded from mem and back into file if it is not being used. But load it up again if it is called
Tnx
It sounds like you want a sorted set - see https://redis.io/topics/data-types#sorted-sets. You would use the date as the value, perhaps in epoch time (since it needs to be a number). SE.Redis supports all the operations you would expect to get ranges of values (either positional ranges - the first 20 records, etc; or absolute ranges bases on the value - all items between two dates expressed in the same unit). Look at the methods starting " SortedSet...".
The value can be binary, so protobuf-net is fine (you would serialize the value for each date separately). Just pass a byte[] as the value. You need to handle serialization separately to the redis library.
As for swapping data out: no. Redis has date-based expiration, but doesn't have hot and cold storage. It is either there, or it isn't. You could use scheduled tasks to purge or move data based on date ranges, again using any of the Z* (redis) or SortedSet* (SE.Redis) methods.
For the complete list of Z* operations, see: https://redis.io/commands#sorted_set. They should all be available in SE.Redis.

The best way to search millions of fuzzy hashes

I have the spamsum composite hashes for about ten million files in a database table and I would like to find the files that are reasonably similar to each other. Spamsum hashes are composed of two CTPH hashes of maximum 64 bytes and they look like this:
384:w2mhnFnJF47jDnunEk3SlbJJ+SGfOypAYJwsn3gdqymefD4kkAGxqCfOTPi0ND:wemfOGxqCfOTPi0ND
They can be broken down into three sections (split the string on the colons):
Block size: 384 in the hash above
First signature: w2mhnFnJF47jDnunEk3SlbJJ+SGfOypAYJwsn3gdqymefD4kkAGxqCfOTPi0ND
Second signature: wemfOGxqCfOTPi0ND
Block size refers to the block size for the first signature, and the block size for the second signature is twice that of the first signature (here: 384 x 2 = 768). Each file has one of these composite hashes, which means each file has two signatures with different block sizes.
The spamsum signatures can be compared only if their block sizes correspond. That is to say that the composite hash above can be compared to any other composite hash that contains a signature with a block size of 384 or 768. The similarity of signature strings for hashes with similar block size can be takes as a measure of similarity between the files represented by the hashes.
So if we have:
file1.blk2 = 768
file1.sig2 = wemfOGxqCfOTPi0ND
file2.blk1 = 768
file2.sig1 = LsmfOGxqCfOTPi0ND
We can get a sense of the degree of similarity of the two files by calculating some weighted edit distance (like Levenshtein distance) for the two signatures. Here the two files seem to be pretty similar.
leven_dist(file1.sig2, file2.sig1) = 2
One can also calculate a normalized similarity score between two hashes (see the details here).
I would like to find any two files that are more than 70% similar based on these hashes, and I have a strong preference for using the available software packages (or APIs/SDKs), although I am not afraid of coding my way through the problem.
I have tried breaking the hashes down and indexing them using Lucene (4.7.0), but the search seems to be slow and tedious. Here is an example of the Lucene queries I have tried (for each single signature -- twice per hash and using the case-sensitive KeywordAnalyzer):
(blk1:768 AND sig1:wemfOGxqCfOTPi0ND~0.7) OR (blk2:768 AND sig2:wemfOGxqCfOTPi0ND~0.7)
It seems that Lucene's incredibly fast Levenshtein automata does not accept edit distance limits above 2 (I need it to support up to 0.7 x 64 ≃ 19) and that its normal editing distance algorithm is not optimized for long search terms (the brute force method used does not cut off calculation once the distance limit is reached.) That said, it may be that my query is not optimized for what I want to do, so don't hesitate to correct me on that.
I am wondering whether I can accomplish what I need using any of the algorithms offered by Lucene, instead of directly calculating the editing distance. I have heard that BK-trees are the best way to index for such searches, but I don't know of the available implementations of the algorithm (Does Lucene use those at all?). I have also heard that a probable solution is to narrow down the search list using n-gram methods but I am not sure how that compares to editing distance calculation in terms of inclusiveness and speed (I am pretty sure Lucene supports that one). And by the way, is there a way to have Lucene run a term search in the parallel mode?
Given that I am using Lucene only to pre-match the hashes and that I calculate the real similarity score using the appropriate algorithm later, I just need a method that is at least as inclusive as Levenshtein distance used in similarity score calculation -- that is, I don't want the pre-matching method to exclude hashes that would be flagged as matches by the scoring algorithm.
Any help/theory/reference/code or clue to start with is appreciated.
This is not a definitive answer to the question, but I have tried a number of methods ever since. I am assuming the hashes are saved in a database, but the suggestions remain valid for in-memory data structures as well.
Save all signatures (2 per hash) along with their corresponding block sizes in a separate child table. Since only signatures of the same size can be compared with each other, you can filter the table by block size before starting to compare the signatures.
Reduce all the repetitive sequences of more than three characters to three characters ('bbbbb' -> 'bbb'). Spamsum's comparison algorithm does this automatically.
Spamsum uses a rolling window of 7 to compare signatures, and won't compare any two signatures that do not have a 7-character overlap after eliminating excessive repetitions. If you are using a database that support lists/arrays as fields, create a field with a list of all possible 7-character sequences extracted from each signature. Then create the fastest exact match index you have access to on this field. Before trying to find the distance of two signatures, first try to do exact matches over this field (any seven-gram in common?).
The last step I am experimenting with is to save signatures and their seven-grams as the two modes of a bipartite graph, projecting the graph into single mode (composed of hashes only), and then calculating Levenshtein distance only on adjacent nodes with similar block sizes.
The above steps do a good pre-matching and substantially reduce the number of signatures each signature has to be compared with. It is only after these that the the modified Levenshtein/Damreau distance has to be calculated.

How to group nearby latitude and longitude locations stored in SQL

Im trying to analyse data from cycle accidents in the UK to find statistical black spots. Here is the example of the data from another website. http://www.cycleinjury.co.uk/map
I am currently using SQLite to ~100k store lat / lon locations. I want to group nearby locations together. This task is called cluster analysis.
I would like simplify the dataset by ignoring isolated incidents and instead only showing the origin of clusters where more than one accident have taken place in a small area.
There are 3 problems I need to overcome.
Performance - How do I ensure finding nearby points is quick. Should I use SQLite's implementation of an R-Tree for example?
Chains - How do I avoid picking up chains of nearby points?
Density - How to take cycle population density into account? There is a far greater population density of cyclist in london then say Bristol, therefore there appears to be a greater number of backstops in London.
I would like to avoid 'chain' scenarios like this:
Instead I would like to find clusters:
London screenshot (I hand drew some clusters)...
Bristol screenshot - Much lower density - the same program ran over this area might not find any blackspots if relative density was not taken into account.
Any pointers would be great!
Well, your problem description reads exactly like the DBSCAN clustering algorithm (Wikipedia). It avoids chain effects in the sense that it requires them to be at least minPts objects.
As for the differences in densities across, that is what OPTICS (Wikipedia) is supposed do solve. You may need to use a different way of extracting clusters though.
Well, ok, maybe not 100% - you maybe want to have single hotspots, not areas that are "density connected". When thinking of an OPTICS plot, I figure you are only interested in small but deep valleys, not in large valleys. You could probably use the OPTICS plot an scan for local minima of "at least 10 accidents".
Update: Thanks for the pointer to the data set. It's really interesting. So I did not filter it down to cyclists, but right now I'm using all 1.2 million records with coordinates. I've fed them into ELKI for analysis, because it's really fast, and it actually can use the geodetic distance (i.e. on latitude and longitude) instead of Euclidean distance, to avoid bias. I've enabled the R*-tree index with STR bulk loading, because that is supposed to help to get the runtime down a lot. I'm running OPTICS with Xi=.1, epsilon=1 (km) and minPts=100 (looking for large clusters only). Runtime was around 11 Minutes, not too bad. The OPTICS plot of course would be 1.2 million pixels wide, so it's not really good for full visualization anymore. Given the huge threshold, it identified 18 clusters with 100-200 instances each. I'll try to visualize these clusters next. But definitely try a lower minPts for your experiments.
So here are the major clusters found:
51.690713 -0.045545 a crossing on A10 north of London just past M25
51.477804 -0.404462 "Waggoners Roundabout"
51.690713 -0.045545 "Halton Cross Roundabout" or the crossing south of it
51.436707 -0.499702 Fork of A30 and A308 Staines By-Pass
53.556186 -2.489059 M61 exit to A58, North-West of Manchester
55.170139 -1.532917 A189, North Seaton Roundabout
55.067229 -1.577334 A189 and A19, just south of this, a four lane roundabout.
51.570594 -0.096159 Manour House, Picadilly Line
53.477601 -1.152863 M18 and A1(M)
53.091369 -0.789684 A1, A17 and A46, a complex construct with roundabouts on both sides of A1.
52.949281 -0.97896 A52 and A46
50.659544 -1.15251 Isle of Wight, Sandown.
...
Note, these are just random points taken from the clusters. It may be sensible to compute e.g. cluster center and radius instead, but I didn't do that. I just wanted to get a glimpse of that data set, and it looks interesting.
Here are some screenshots, with minPts=50, epsilon=0.1, xi=0.02:
Notice that with OPTICS, clusters can be hierarchical. Here is a detail:
First, your example is quite misleading. You have two different sets of data, and you don't control the data. If it appears in a chain, then you will get a chain out.
This problem is not exactly suitable for a database. You'll have to write code or find a package that implements this algorithm on your platform.
There are many different clustering algorithms. One, k-means, is an iterative algorithm where you look for a fixed number of clusters. k-means requires a few complete scans of the data, and voila, you have your clusters. Indexes are not particularly helpful.
Another, which is usually appropriate on slightly smaller data sets, is hierarchical clustering -- you put the two closest things together, and then build the clusters. An index might be helpful here.
I recommend though that you peruse a site such as kdnuggets in order to see what software -- free and otherwise -- is available.

VB.NET Comparing files with Levenshtein algorithm

I'd like to use the Levenshtein algorithm to compare two files in VB.NET. I know I can use an MD5 hash to determine if they're different, but I want to know HOW MUCH different the two files are. The files I'm working with are both around 250 megs. I've experimented with different ways of doing this and I've realized I really can't load both files into memory (all kinds of string-related issues). So I figured I'd just stream the bytes I need as I go. Fine. But the implementations that I've found of the Levenshtein algorithm all dimension a matrix that's length 1 * length 2 in size, which in this case is impossible to work with. I've heard there's a way to do this with just two vectors instead of the whole matrix.
How can I compute Levenshtein distance of two large files without declaring a matrix that's the product of their file sizes?
Note that the values in each row of the Levenshtein matrix depend only on the values in the row above it. This means that you only need two one-dimensional arrays: one contains the values of the current row; the other is populated with the new values that you can compute from the current row. Then, you swap their roles (the "new" row becomes the "current" row and vice versa) and continue.
Note that this approach only lets you compute the Levenshtein distance (which seems to be what you want); it cannot tell you which operations must be done in order to transform one string into the other. There exists a very clever modification of the algorithm that lets you reconstruct the edit operations without using nm memory, but I've forgotten how it works.

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.