Search for (Very) Approximate Substrings in a Large Database - indexing

I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that Lucene could do it, but is Lucene's levenshtein algorithm fast enough for hundreds of edits? Perhaps something out of the world of plagiarism detection? Any advice is appreciated.

Q-grams could be one approach, but there are others such as Blast, BlastP - which are used for Protein, nucleotide matches etc.
The Simmetrics library is a comprehensive collection of string distance approaches.

Lucene does not seem to be the right tool here. In addition to Mikos' fine suggestions, I have heard about AGREP, FASTA and Locality-Sensitive Hashing(LSH). I believe that an efficient method should first prune the search space heavily, and only then do more sophisticated scoring on the remaining candidates.

Related

Apriori algorithm expert is needed

I have a dataset with 3.3M rows and 8k unique products.
I wanted to apply apriori algorithm to find association rules and connections between products.
Well, I did it before on a much smaller database with 50k rows and maybe 200 unique products..
Someone knows how can I do it effectively with larger scales of data? How can I still make it work for me maybe there are tricks to reduce the scale of the data but still get effective results.
Any help would be amazing! Reach me out if you experienced with this algorithm.
The trick is: Don't use Apriori.
Use LCM or the top-down version of FP-Growth.
You can find my implementations here:
command line programs: https://borgelt.net/fim.html (eclat with option -o gives LCM)
Python: https://borgelt.net/pyfim.html
R: https://borgelt.net/fim4r.html

How to use percentage (floating) similarity fuzzy queries in Lucene?

Lucene, version: 7.3.0.
All I want is to use percentage (floating) similarity fuzzy queries (FuzzyQuery class) in Lucene.
defaultMinSimilarity is now deprecated, so I can use only defaultMaxEdits for my purposes.
As far as I can see, maximal supported distance for org.apache.lucene.search.FuzzyQuery can't be more than 2:
MAXIMUM_SUPPORTED_DISTANCE = 2
What if I want to search for 55% similar strings, but for a term with a big length?
How can I do that with Lucene's FuzzyQuery?
Can I bypass that maximum-2-step edit distance restriction at all?
Can you bypass that FuzzyQuery limitation? No. Can you do it at all? Almost certainly yes, but you need to rethink the problem a bit. FuzzyQuery is not the answer.
You should consider, instead, how you could use analysis to solve your problem. Indexing NGrams would be the most direct solution for very loose, fuzzy style matching, see NGramTokenFilter.

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Oracle String Conversion - Alpha String to Numeric Score, Fuzzy Match

I'm working with a lot of name data where the following events are happening:
In one stream the data is submitted as "Sung" and in the other stream "Snug" my initial thought to this was to convert Sung and Snug to where each character equals a number then the sums would be the same, so even if they transverse a character, I'd be able to bucket these appropriately.
The other is where in one stream it comes in as "Lillly" as opposed to "Lilly" in the other stream. I'd like to figure out how to fuzzy match these such that I can identify them. I'm not sure if this is possible in Oracle.
I'm working with many millions of data points and trying to figure out how to write these classification buckets such that I can stop having so much noise in my primary task of finding where people are truly different people as opposed to a clerical error.
Any thoughts would be very appreciated.
A common measure for such distance is called Levenshtein distance (Wikipedia here). This measures the "edit" distance between two strings -- number of edit operations needed to convert one into the other.
That's the good news. More good news is that Oracle even has an implementation in the UTL_MATCH library.
The bad news is that it is really, really expensive on millions of data points. Unfortunately, I cannot help you there so much. One idea is to determine which names are "close enough" because they already share a certain minimum number of characters.
Another method is to convert the strings to what they sound like. That is called soundex. You may be able to use the two together -- assuming your names are predominantly English (the soundex algorithm was invented by the US Census Bureau, so it would work best on names in America).

Optimal combination of files to the blocks of 4.8GB

My drive has DMG-blocks. The sum of their sizes is strictly below 47GB. I have 11 DVDs, each of the size 4.7GB. I want to use as small amount of DVDs as possible, without using compressing (the problem may be superflous, since it considers the most optimal combinations in terms of DMG-files. You can think it in terms of compressed files, if you want.).
You can see that the DMG-files have arbitrary sizes. So many solutions are possible.
find . -iname "*.dmg" -exec du '{}' \; 3&> /dev/null
1026064 ./Desktop/Desktop2.dmg
5078336 ./Desktop/Desktop_2/CS_pdfs.dmg
2097456 ./Desktop/Desktop_2/Signal.dmg
205104 ./Dev/things.dmg
205040 ./Dev/work.dmg
1026064 ./DISKS/fun.dmg
1026064 ./DISKS/school.dmg
1026064 ./DISKS/misc.dmg
5078336 ./something.dmg
The files in DVDs can have an arbitrary order. For example, CS_pdfs.dmg and Signal.dmg do not need to be on the some disk.
So how can you find the way to use as small amount of DVDs as possible?
Your problem is called bin packing problem mathematically (which is related to the knapsack problem.)
Since it is np-hard, it very difficult to solve this efficiently! There is a recursive solution (dynamic programming + backtracking) but even this may require big amounts of space and computation time.
The most straightforward solution is a greedy algorithm (see Blindy's post), but this may give bad results.
It depends on how many items (n) you want to pack and how precise the solution must be (more precision will increase the runtime!). For small n the recursive/bruteforce or backtracking solution is sufficient, for bigger problems I'd advice to use some metaheuristic - especially genetic algorithms work quite well and yield good approximations in acceptable timespans.
Totally alternate solution: Use split and cut up the borders onto multiple DVDs. You'll get 100% utilization of every disk but the last. http://unixhelp.ed.ac.uk/CGI/man-cgi?split
You should probably try the greedy algorithm before anything else - That is, pick the largest item that can fit on the remaining DVD each time. While this is not guaranteed to work well, this problem is NP-complete and so no efficient solution exists. I had a similar problem recently, and the greedy algorithm worked quite well in my case - maybe it'll be good enough in yours as well.
The most generic solution would involve implementing a simple backtracking algorithm, but I'm fairly certain that in this particular case you can just sort them by size and pick the largest file that fits on your disc over and over until it's full, then move on to the next with the remaining files.