VB.NET Comparing files with Levenshtein algorithm - vb.net

I'd like to use the Levenshtein algorithm to compare two files in VB.NET. I know I can use an MD5 hash to determine if they're different, but I want to know HOW MUCH different the two files are. The files I'm working with are both around 250 megs. I've experimented with different ways of doing this and I've realized I really can't load both files into memory (all kinds of string-related issues). So I figured I'd just stream the bytes I need as I go. Fine. But the implementations that I've found of the Levenshtein algorithm all dimension a matrix that's length 1 * length 2 in size, which in this case is impossible to work with. I've heard there's a way to do this with just two vectors instead of the whole matrix.
How can I compute Levenshtein distance of two large files without declaring a matrix that's the product of their file sizes?

Note that the values in each row of the Levenshtein matrix depend only on the values in the row above it. This means that you only need two one-dimensional arrays: one contains the values of the current row; the other is populated with the new values that you can compute from the current row. Then, you swap their roles (the "new" row becomes the "current" row and vice versa) and continue.
Note that this approach only lets you compute the Levenshtein distance (which seems to be what you want); it cannot tell you which operations must be done in order to transform one string into the other. There exists a very clever modification of the algorithm that lets you reconstruct the edit operations without using nm memory, but I've forgotten how it works.

Related

How can I recode 53k unique addresses (saved as objects) w/o One-Hot-Encoding in Pandas?

My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.
But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.
I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?
EDIT: My target variable is a simple binary one.
Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:
One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)

What is a use case of `SSWAP`?

In doing some stuff with BLAS operations I see the level 1 operation SSWAP.
I can't come up with a programming use case for this.
My thinking is, if you where passing y to a function but wanted it with the values of x, why not simply pass x? Swapping the values seems rather convoluted.
This is just a question out of curiosity.
Sometimes swapping the content of two (stride) vectors is exactly what you need. For instance when doing row or column interchanges in pivoting during LU factorization -- the reference BLAS uses xSWAP in xGBTRF. The pivoting algorithm for LU decomposition requires swapping the content of two rows (or columns). These two rows (or columns) can be thought of as two vectors (possibly with non-unit stride between the elements). One needs to do many such interchanges along the way, and they gradually change, so there is no option to "just send some other line to a function" at the end of the algorithm.
To sum up, as a basic building block of more complex algorithms, a (potentially) optimized routine for interchanging columns or rows of a matrix seems useful.

Storing DNA in Swift

I'm going to write an application for dealing with raw DNA data samples, as the files you get from MyHeritage, Ancestry, FamilyTreeDNA, 23&me etc. Each of these files are basically a CSV-file with some quirks, and asked about decoding them in another question I posted earlier.
Now for the next part. When I have parsed/decoded those files, I want to put the DNA data in a database, so that I can compare one persons DNA to that of another person. It's a lot of data, but not more than most computers can handle.
In memory, I can have the full DNA for both persons, and compare them, and then create ArraySlices for the segments of DNA data that overlap, but ArraySlices aren't suitable for storage, In memory the ArraySlice can't exist by itself. It's just a reference into the full array, so if I would flatten the ArraySlice I would still get the whole array, even for the segments that don't match.
Each person shall have their full DNA on backing store, and can be read into memory, but how would you store the matching segments?
I'm thinking of something like:
//Bucket is the term FamilyTreeDNA for indicating whether the DNA match is on the maternal or paternal chromosome.
enum Bucket {
case .maternal
case .paternal
}
//THe first 22 chromosome pairs are just numbered from 1 to 22, but the last two are X and Y, so I use a String for storing chromosome "number"
struct SharedSegment {
let onChromosome: String
let bucket: Bucket
let range: Range<UInt>
}
I don't care if it takes more disk space, but I want to have lightning fast comparisons of DNA, so that I can compare all the DNA for all the individuals in a datavase, without it taking months to do so. Also storage space for the full DNA to make comparisons.
At the first stage I'm just building an app for storing the DNA kits I administer, but I already have plans for a services of type Gedmatch and DNAPainter if you have tried them. This means it's a services where people can upload their DNA to be compared to other peoples DNA, and lets says a million people upload their DNA to this service, and each of them should have their DNA compared to the other 999'999 people. The number of comparison will be huge, so my primary focus is on performance. Each file with raw DNA data will contain about 400-950 thousand lines of DNA data.
Each line will contain the chromosome number, the RSID, the position within the chromosome and a genotype. The latter is two letters "AA", "AC", "CT" etc. There are four different letters A, C, G and T. The reason there are two letter for each position is that you have chromosome pairs, where there is one chromosome inherited from the father and one from the mother, and there is one letter from each of those two chromosomes. Of course I can store them as just a string of characters, but there are chances of errors, so I would like to represent them in code as
enum Aminoacid {
noCall = 0
case A
case C
case G
case T
}
When sequencing DNA there are sometimes problem, and the sequencing equipment can't determine which amino acid it is in a certain position. This is called a "no call", therefore the case noCall in the enum. in the raw DNA file this is represented by a dash, so it can say in the results "-A"m which means that one of the parents had an A in that position, and the other could not be determined.
Is there any possibility to squeeze them together in 4 bits (nybble), so that I can store two of these letter per byte?It's even possible to squeeze into 3 bits, but I can't get three letters into a byte anyway. It solve be two letter á 3 birs each and two bits wasted in every byte, so I could just as well use 4 bits per amino acid. There are Uint64, Uint32, Uint16 and Uint8 in Swift, but no Uint4, which would be ideal for this case. I'm also thinking about whether to store the two letter from the maternal and paternal chromosome together or if U should split them into separate array One array for maternal DNA and one for paternal). There is a problem with that approach, and that is it's impossible to tell if the first letter on each row is maternal or paternal, until you have the DNA from at least one of the parents to compare with. In absence of their DNA, I would have to have a third array to store both letters in, until I can determine swhich one is maternal end paternal respectively. I'm trying to come up with the most effective way of storing this, to make the comparisons super fast.
In one way I don't like using enums, and that is because I will have to convert them to rawValue, do I can do something like
var genotype = Aminoacid.A.rawValue << 4 + Aminoacid.G.rawValue
As far as I can see that's the best way to squeeze two of these into ont byte, since there's no UInt4.
I'm not so fond of having lots of .rawValue all over my code. I would like to have only Aminoacid.A << 4 + Aminoacid.G, but unfortunately I don't think this is possible. Maybe there is a beter way to store these sequences of amino acids in the database, like enums with associated values or something. I don't know how efficient associated values till be, when working with such large data sets.
If there is anyone out there, that wants to collaborate on this project, that is so far just a hobby project, but I have plans for making a business out of it eventually. This means I can't employ anyone to do this, but if you're working on similar projects then let me know. We can make better things together. Just be aware that I'm writing in Swift, and I'm going to deploy on macOS, but Swift is also available for other platforms, so coders for Linux and Windows are equally welcome to work on a joint project.
This became a little offtopic. My question was about storage of raw DNA and shared segments in a way that is optimal for fast search and comparison of huge amounts of DNA.I probably won't use CoreData for storage, since I would like to keep the options for porting to other platforms than Apples. At the moment I'm using CoreData to experiment a little with storing DNA in different ways.

Similarity matching algorithm

I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.
For example, if a product has:
Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new
Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.
I'm open to suggestions as the project is just starting.
One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:
Colors Weight Height Condition
00011011000 10110110 10001100 01
Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).
The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.
Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.
Further question: also if I could use different weights for each attribute, it would be awesome.
You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.
Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.
You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.
What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.
You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.
What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements.
When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).
Simple example of BK-tree (from the second link)
BOOK
/ \
/(1) \(4)
/ \
BOOKS CAKE
/ / \
/(2) /(1) \(2)
/ | |
BOO CAPE CART
Your metric should be Hamming distance (count of differences between bit representations of two objects).
BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?
BK-tree with details:
https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/

The best way to search millions of fuzzy hashes

I have the spamsum composite hashes for about ten million files in a database table and I would like to find the files that are reasonably similar to each other. Spamsum hashes are composed of two CTPH hashes of maximum 64 bytes and they look like this:
384:w2mhnFnJF47jDnunEk3SlbJJ+SGfOypAYJwsn3gdqymefD4kkAGxqCfOTPi0ND:wemfOGxqCfOTPi0ND
They can be broken down into three sections (split the string on the colons):
Block size: 384 in the hash above
First signature: w2mhnFnJF47jDnunEk3SlbJJ+SGfOypAYJwsn3gdqymefD4kkAGxqCfOTPi0ND
Second signature: wemfOGxqCfOTPi0ND
Block size refers to the block size for the first signature, and the block size for the second signature is twice that of the first signature (here: 384 x 2 = 768). Each file has one of these composite hashes, which means each file has two signatures with different block sizes.
The spamsum signatures can be compared only if their block sizes correspond. That is to say that the composite hash above can be compared to any other composite hash that contains a signature with a block size of 384 or 768. The similarity of signature strings for hashes with similar block size can be takes as a measure of similarity between the files represented by the hashes.
So if we have:
file1.blk2 = 768
file1.sig2 = wemfOGxqCfOTPi0ND
file2.blk1 = 768
file2.sig1 = LsmfOGxqCfOTPi0ND
We can get a sense of the degree of similarity of the two files by calculating some weighted edit distance (like Levenshtein distance) for the two signatures. Here the two files seem to be pretty similar.
leven_dist(file1.sig2, file2.sig1) = 2
One can also calculate a normalized similarity score between two hashes (see the details here).
I would like to find any two files that are more than 70% similar based on these hashes, and I have a strong preference for using the available software packages (or APIs/SDKs), although I am not afraid of coding my way through the problem.
I have tried breaking the hashes down and indexing them using Lucene (4.7.0), but the search seems to be slow and tedious. Here is an example of the Lucene queries I have tried (for each single signature -- twice per hash and using the case-sensitive KeywordAnalyzer):
(blk1:768 AND sig1:wemfOGxqCfOTPi0ND~0.7) OR (blk2:768 AND sig2:wemfOGxqCfOTPi0ND~0.7)
It seems that Lucene's incredibly fast Levenshtein automata does not accept edit distance limits above 2 (I need it to support up to 0.7 x 64 ≃ 19) and that its normal editing distance algorithm is not optimized for long search terms (the brute force method used does not cut off calculation once the distance limit is reached.) That said, it may be that my query is not optimized for what I want to do, so don't hesitate to correct me on that.
I am wondering whether I can accomplish what I need using any of the algorithms offered by Lucene, instead of directly calculating the editing distance. I have heard that BK-trees are the best way to index for such searches, but I don't know of the available implementations of the algorithm (Does Lucene use those at all?). I have also heard that a probable solution is to narrow down the search list using n-gram methods but I am not sure how that compares to editing distance calculation in terms of inclusiveness and speed (I am pretty sure Lucene supports that one). And by the way, is there a way to have Lucene run a term search in the parallel mode?
Given that I am using Lucene only to pre-match the hashes and that I calculate the real similarity score using the appropriate algorithm later, I just need a method that is at least as inclusive as Levenshtein distance used in similarity score calculation -- that is, I don't want the pre-matching method to exclude hashes that would be flagged as matches by the scoring algorithm.
Any help/theory/reference/code or clue to start with is appreciated.
This is not a definitive answer to the question, but I have tried a number of methods ever since. I am assuming the hashes are saved in a database, but the suggestions remain valid for in-memory data structures as well.
Save all signatures (2 per hash) along with their corresponding block sizes in a separate child table. Since only signatures of the same size can be compared with each other, you can filter the table by block size before starting to compare the signatures.
Reduce all the repetitive sequences of more than three characters to three characters ('bbbbb' -> 'bbb'). Spamsum's comparison algorithm does this automatically.
Spamsum uses a rolling window of 7 to compare signatures, and won't compare any two signatures that do not have a 7-character overlap after eliminating excessive repetitions. If you are using a database that support lists/arrays as fields, create a field with a list of all possible 7-character sequences extracted from each signature. Then create the fastest exact match index you have access to on this field. Before trying to find the distance of two signatures, first try to do exact matches over this field (any seven-gram in common?).
The last step I am experimenting with is to save signatures and their seven-grams as the two modes of a bipartite graph, projecting the graph into single mode (composed of hashes only), and then calculating Levenshtein distance only on adjacent nodes with similar block sizes.
The above steps do a good pre-matching and substantially reduce the number of signatures each signature has to be compared with. It is only after these that the the modified Levenshtein/Damreau distance has to be calculated.