Lucene ignores / overwrite fuzzy edit distance in QueryParser - lucene

Given the following QueryParser with a FuzzySearch term in the query string:
fun fuzzyquery() {
val query = QueryParser("term", GermanAnalyzer()).parse("field:search~4")
println(query)
}
The resulting Query will actually have this representation:
field:search~2
So, the ~4 gets rewritten to ~2. I traced the code down to the following implementation:
QueryParserBase
protected Query newFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) {
String text = term.text();
int numEdits = FuzzyQuery.floatToEdits(minimumSimilarity, text.codePointCount(0, text.length()));
return new FuzzyQuery(term, numEdits, prefixLength);
}
FuzzyQuery
public static int floatToEdits(float minimumSimilarity, int termLen) {
if (minimumSimilarity >= 1.0F) {
return (int)Math.min(minimumSimilarity, 2.0F);
} else {
return minimumSimilarity == 0.0F ? 0 : Math.min((int)((1.0D - (double)minimumSimilarity) * (double)termLen), 2);
}
}
As is clearly visible, any value higher than 2 will always get reset to 2. Why is this and how can I correctly get the fuzzy edit distance I want into the query parser?

This may cross the border into "not an answer" - but it is too long for a comment (or a few comments):
Why is this?
That was a design decision, it would seem. It's mentioned in the documentation here.
"The value is between 0 and 2"
There is an old article here which gives an explanation:
"Larger differences are far more expensive to compute efficiently and are not processed by Lucene.".
I don't know how official that is, however.
More officially, from the JavaDoc for the FuzzyQuery class, it states:
"At most, this query will match terms up to 2 edits. Higher distances (especially with transpositions enabled), are generally not useful and will match a significant amount of the term dictionary."
How can I correctly get the fuzzy edit distance I want into the query parser?
You cannot, unless you customize the source code.
The best (least worst?) alternative, I think, is probably the one mentioned in the above referenced FuzzyQuery Javadoc:
"If you really want this, consider using an n-gram indexing technique (such as the SpellChecker in the suggest module) instead."
In this case, one price to be paid will be a potentially much larger index - and even then, n-grams are not really equivalent to edit distances. I don't know if this would meet your needs.

Related

Binary search start or end is target

Why is it that when I see example code for binary search there is never an if statement to check if the start of the array or end is the target?
import java.util.Arrays;
public class App {
public static int binary_search(int[] arr, int left, int right, int target) {
if (left > right) {
return -1;
}
int mid = (left + right) / 2;
if (target == arr[mid]) {
return mid;
}
if (target < arr[mid]) {
return binary_search(arr, left, mid - 1, target);
}
return binary_search(arr, mid + 1, right, target);
}
public static void main(String[] args) {
int[] arr = { 3, 2, 4, -1, 0, 1, 10, 20, 9, 7 };
Arrays.sort(arr);
for (int i = 0; i < arr.length; i++) {
System.out.println("Index: " + i + " value: " + arr[i]);
}
System.out.println(binary_search(arr, arr[0], arr.length - 1, -1));
}
}
in this example if the target was -1 or 20 the search would enter recursion. But it added an if statement to check if target is mid, so why not add two more statements also checking if its left or right?
EDIT:
As pointed out in the comments, I may have misinterpreted the initial question. The answer below assumes that OP meant having the start/end checks as part of each step of the recursion, as opposed to checking once before the recursion even starts.
Since I don't know for sure which interpretation was intended, I'm leaving this post here for now.
Original post:
You seem to be under the impression that "they added an extra check for mid, so surely they should also add an extra check for start and end".
The check "Is mid the target?" is in fact not a mere optimization they added. Recursively checking "mid" is the whole point of a binary search.
When you have a sorted array of elements, a binary search works like this:
Compare the middle element to the target
If the middle element is smaller, throw away the first half
If the middle element is larger, throw away the second half
Otherwise, we found it!
Repeat until we either find the target or there are no more elements.
The act of checking the middle is fundamental to determining which half of the array to continue searching through.
Now, let's say we also add a check for start and end. What does this gain us? Well, if at any point the target happens to be at the very start or end of a segment, we skip a few steps and end slightly sooner. Is this a likely event?
For small toy examples with a few elements, yeah, maybe.
For a massive real-world dataset with billions of entries? Hm, let's think about it. For the sake of simplicity, we assume that we know the target is in the array.
We start with the whole array. Is the first element the target? The odds of that is one in a billion. Pretty unlikely. Is the last element the target? The odds of that is also one in a billion. Pretty unlikely too. You've wasted two extra comparisons to speed up an extremely unlikely case.
We limit ourselves to, say, the first half. We do the same thing again. Is the first element the target? Probably not since the odds are one in half a billion.
...and so on.
The bigger the dataset, the more useless the start/end "optimization" becomes. In fact, in terms of (maximally optimized) comparisons, each step of the algorithm has three comparisons instead of the usual one. VERY roughly estimated, that suggests that the algorithm on average becomes three times slower.
Even for smaller datasets, it is of dubious use since it basically becomes a quasi-linear search instead of a binary search. Yes, the odds are higher, but on average, we can expect a larger amount of comparisons before we reach our target.
The whole point of a binary search is to reach the target with as few wasted comparisons as possible. Adding more unlikely-to-succeed comparisons is typically not the way to improve that.
Edit:
The implementation as posted by OP may also confuse the issue slightly. The implementation chooses to make two comparisons between target and mid. A more optimal implementation would instead make a single three-way comparison (i.e. determine ">", "=" or "<" as a single step instead of two separate ones). This is, for instance, how Java's compareTo or C++'s <=> normally works.
BambooleanLogic's answer is correct and comprehensive. I was curious about how much slower this 'optimization' made binary search, so I wrote a short script to test the change in how many comparisons are performed on average:
Given an array of integers 0, ... , N
do a binary search for every integer in the array,
and count the total number of array accesses made.
To be fair to the optimization, I made it so that after checking arr[left] against target, we increase left by 1, and similarly for right, so that every comparison is as useful as possible. You can try this yourself at Try it online
Results:
Binary search on size 10: Standard 29 Optimized 43 Ratio 1.4828
Binary search on size 100: Standard 580 Optimized 1180 Ratio 2.0345
Binary search on size 1000: Standard 8987 Optimized 21247 Ratio 2.3642
Binary search on size 10000: Standard 123631 Optimized 311205 Ratio 2.5172
Binary search on size 100000: Standard 1568946 Optimized 4108630 Ratio 2.6187
Binary search on size 1000000: Standard 18951445 Optimized 51068017 Ratio 2.6947
Binary search on size 10000000: Standard 223222809 Optimized 610154319 Ratio 2.7334
so the total comparisons does seem to tend to triple the standard number, implying the optimization becomes increasingly unhelpful for larger arrays. I'd be curious whether the limiting ratio is exactly 3.
To add some extra check for start and end along with the mid value is not impressive.
In any algorithm design the main concerned is moving around it's complexity either it is time complexity or space complexity. Most of the time the time complexity is taken as more important aspect.
To learn more about Binary Search Algorithm in different use case like -
If Array is not containing any repeated
If Array has repeated element in this case -
a) return leftmost index/value
b) return rightmost index/value
and many more point

Getting Term Frequencies For Query

In Lucene, a query can be composed of many sub-queries. (such as TermQuery objects)
I'd like a way to iterate over the documents returned by a search, and for each document, to then iterate over the sub-queries.
For each sub-query, I'd like to get the number of times it matched. (I'm also interested in the fieldNorm, etc.)
I can get access to that data by using indexSearcher.explain, but that feels quite hacky because I would then need to parse the "description" member of each nested Explanation object to try and find the term frequency, etc. (also, calling "explain" is very slow, so I'm hoping for a faster approach)
The context here is that I'd like to experiment with re-ranking Lucene's top N search results, and to do that it's obviously helpful to extract as many "features" as possible about the matches.
Via looking at the source code for classes like TermQuery, the following appears to be a basic approach:
// For each document... (scoreDoc.doc is an integer)
Weight weight = weightCache.get(query);
if (weight == null)
{
weight = query.createWeight(indexSearcher, true);
weightCache.put(query, weight);
}
IndexReaderContext context = indexReader.getContext();
List<LeafReaderContext> leafContexts = context.leaves();
int n = ReaderUtil.subIndex(scoreDoc.doc, leafContexts);
LeafReaderContext leafReaderContext = leafContexts.get(n);
Scorer scorer = weight.scorer(leafReaderContext);
int deBasedDoc = scoreDoc.doc - leafReaderContext.docBase;
int thisDoc = scorer.iterator().advance(deBasedDoc);
float freq = 0;
if (thisDoc == deBasedDoc)
{
freq = scorer.freq();
}
The 'weightCache' is of type Map and is useful so that you don't have to re-create the Weight object for every document you process. (otherwise, the code runs about 10x slower)
Is this approximately what I should be doing? Are there any obvious ways to make this run faster? (it takes approx 2 ms for 280 documents, as compared to about 1 ms to perform the query itself)
Another challenge with this approach is that it requires code to navigate through your Query object to try and find the sub-queries. For example, if it's a BooleanQuery, you call query.clauses() and recurse on them to look for all leaf TermQuery objects, etc. Not sure if there is a more elegant / less brittle way to do that.

How can I ask Lucene to do simple, flat scoring?

Let me preface by saying that I'm not using Lucene in a very common way and explain how my question makes sense. I'm using Lucene to do searches in structured records. That is, each document, that is indexed, is a set of fields with short values from a given set. Each field is analysed and stored, the analysis producing usually no more than 3 and in most cases just 1 normalised token. As an example, imagine files for each of which we store two fields: the path to the file and a user rating in 1-5. The path is tokenized with a PathHierarchyTokenizer and the rating is just stored as-is. So, if we have a document like
path: "/a/b/file.txt"
rating: 3
This document will have for its path field the tokens "/a", "/a/b" and "/a/b/file.ext", and for rating the token "3".
I wish to score this document against a query like "path:/a path:/a/b path:/a/b/different.txt rating:1" and get a value of 2 - the number of terms that match.
My understanding and observation is that the score of the document depends on various term metrics and with many documents with many fields each, I most definitely am not getting simple integer scores.
Is there some way to make Lucene score documents in the outlined fashion? The queries that are run against the index are not generated by the users, but are built by the system and have an optional filter attached, meaning they all have a fixed form of several TermQuerys joined in a BooleanQuery with nothing like any fuzzy textual searches. Currently I don't have the option of replacing Lucene with something else, but suggestions are welcome for a future development.
I doubt there's something ready to use, so most probably you will need to implement your own scorer and use it when searching. For complicated cases you may want to play around with queries, but for simple case like yours it should be enough to overwrite DefaultSimilarity setting tf factor to raw frequency (number of specified terms in document in question) and all other components to 1. Something like this:
public class MySimilarity extends DefaultSimilarity {
#Override
public float computeNorm(String field, FieldInvertState state) {
return 1;
}
#Override
public float queryNorm(float sumOfSquaredWeights) {
return 1;
}
#Override
public float tf(float freq) {
return freq;
}
#Override
public float idf(int docFreq, int numDocs) {
return 1;
}
#Override
public float coord(int overlap, int maxOverlap) {
return 1;
}
}
(Note, that tf() is the only method that returns something different than 1)
And the just set similarity on IndexSearcher.

Lucene SpellChecker Prefer Permutations or special scoring

I'm using Lucene.NET 3.0.3
How can I modify the scoring of the SpellChecker (or queries in general) using a given function?
Specifically, I want the SpellChecker to score any results that are permutations of the searched word higher than the the rest of the suggestions, but I don't know where this should be done.
I would also accept an answer explaining how to do this with a normal query. I have the function, but I don't know if it would be better to make it a query or a filter or something else.
I think the best way to go about this would be to use a customized Comparator in the SpellChecker object.
Check out the source code of the default comparator here:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-spellchecker/3.6.0/org/apache/lucene/search/spell/SuggestWordScoreComparator.java?av=f
Pretty simple stuff, should be easy to extend if you already have the algorithm you want to use to compare the two Strings.
Then you can use set it up to use your comparator with SpellChecker.SetComparator
I think I mentioned the possiblity of using a Filter for this in a previous question to you, but I don't think that's really the right way to go, looking at it a bit more.
EDIT---
Indeed, No Comparator is available in 3.0.3, So I believe you'll need to access the scoring through the a StringDistance object. The Comparator would be nicer, since the scoring has already been applied and is passed into it to do what you please with it. Extending a StringDistance may be a bit less concrete since you will have to apply your rules as a part of the score.
You'll probably want to extend LevensteinDistance (source code), which is the default StringDistance implementation, but of course, feel free to try JaroWinklerDistance as well. Not really that familiar with the algorithm.
Primarily, you'll want to override getDistance and apply your scoring rules there, after getting a baseline distance from the standard (parent) implementation's getDistance call.
I would probably implement something like (assuming you ahve a helper method boolean isPermutation(String, String):
class CustomDistance() extends LevensteinDistance{
float getDistance(String target, String other) {
float distance = super.getDistance();
if (isPermutation(target, other)) {
distance = distance + (1 - distance) / 2;
}
return distance;
}
}
To calculate a score half again closer to 1 for a result that is a permuation (that is, if the default algorithm gave distance = .6, this would return distance = .8, etc.). Distances returned must be between 0 and 1. My example is just one idea of a possible scoring for it, but you will likely need to tune your algorithm somewhat. I'd be cautious about simply returning 1.0 for all permutations, since that would be certain to prefer 'isews' over 'weis' when looking with 'weiss', and it would also lose the ability to sort the closeness of different permutations ('isews' and 'wiess' would be equal matches to 'weiss' in that case).
Once you have your Custom StringDistance it can be passed to SpellChecker either through the Constructor, or with SpellChecker.setStringDistance
From femtoRgon's advice, here's what I ended up doing:
public class PermutationDistance: SpellChecker.Net.Search.Spell.StringDistance
{
public PermutationDistance()
{
}
public float GetDistance(string target, string other)
{
LevenshteinDistance l = new LevenshteinDistance();
float distance = l.GetDistance(target, other);
distance = distance + ((1 - distance) * PermutationScore(target, other));
return distance;
}
public bool IsPermutation(string a, string b)
{
char[] ac = a.ToLower().ToCharArray();
char[] bc = b.ToLower().ToCharArray();
Array.Sort(ac);
Array.Sort(bc);
a = new string(ac);
b = new string(bc);
return a == b;
}
public float PermutationScore(string a, string b)
{
char[] ac = a.ToLower().ToCharArray();
char[] bc = b.ToLower().ToCharArray();
Array.Sort(ac);
Array.Sort(bc);
a = new string(ac);
b = new string(bc);
LevenshteinDistance l = new LevenshteinDistance();
return l.GetDistance(a, b);
}
}
Then:
_spellChecker.setStringDistance(new PermutationDistance());
List<string> suggestions = _spellChecker.SuggestSimilar(word, 10).ToList();

Practice of checking 'trueness' or 'equality' in conditional statements - does it really make sense?

I remember many years back, when I was in school, one of my computer science teachers taught us that it was better to check for 'trueness' or 'equality' of a condition and not the negative stuff like 'inequality'.
Let me elaborate - If a piece of conditional code can be written by checking whether an expression is true or false, we should check the 'trueness'.
Example: Finding out whether a number is odd - it can be done in two ways:
if ( num % 2 != 0 )
{
// Number is odd
}
or
if ( num % 2 == 1 )
{
// Number is odd
}
(Please refer to the marked answer for a better example.)
When I was beginning to code, I knew that num % 2 == 0 implies the number is even, so I just put a ! there to check if it is odd. But he was like 'Don't check NOT conditions. Have the practice of checking the 'trueness' or 'equality' of conditions whenever possible.' And he recommended that I use the second piece of code.
I am not for or against either but I just wanted to know - what difference does it make? Please don't reply 'Technically the output will be the same' - we ALL know that. Is it a general programming practice or is it his own programming practice that he is preaching to others?
NOTE: I used C#/C++ style syntax for no reason. My question is equally applicable when using the IsNot, <> operators in VB etc. So readability of the '!' operator is just one of the issues. Not THE issue.
The problem occurs when, later in the project, more conditions are added - one of the projects I'm currently working on has steadily collected conditions over time (and then some of those conditions were moved into struts tags, then some to JSTL...) - one negative isn't hard to read, but 5+ is a nightmare, especially when someone decides to reorganize and negate the whole thing. Maybe on a new project, you'll write:
if (authorityLvl!=Admin){
doA();
}else{
doB();
}
Check back in a month, and it's become this:
if (!(authorityLvl!=Admin && authorityLvl!=Manager)){
doB();
}else{
doA();
}
Still pretty simple, but it takes another second.
Now give it another 5 to 10 years to rot.
(x%2!=0) certainly isn't a problem, but perhaps the best way to avoid the above scenario is to teach students not to use negative conditions as a general rule, in the hopes that they'll use some judgement before they do - because just saying that it could become a maintenance problem probably won't be enough motivation.
As an addendum, a better way to write the code would be:
userHasAuthority = (authorityLvl==Admin);
if (userHasAuthority){
doB();
else{
doA();
}
Now future coders are more likely to just add "|| authorityLvl==Manager", userHasAuthority is easier to move into a method, and even if the conditional is reorganized, it will only have one negative. Moreover, no one will add a security hole to the application by making a mistake while applying De Morgan's Law.
I will disagree with your old professor - checking for a NOT condition is fine as long as you are checking for a specific NOT condition. It actually meets his criteria: you would be checking that it is TRUE that a value is NOT something.
I grok what he means though - mostly the true condition(s) will be orders of magnitude smaller in quantity than the NOT conditions, therefore easier to test for as you are checking a smaller set of values.
I've had people tell me that it's to do with how "visible" the ping (!) character is when skim reading.
If someone habitually "skim reads" code - perhaps because they feel their regular reading speed is too slow - then the ! can be easily missed, giving them a critical mis-understanding of the code.
On the other hand, if a someone actually reads all of the code all of the time, then there is no issue.
Two very good developers I've worked with (and respect highily) will each write == false instead of using ! for similar reasons.
The key factor in my mind is less to do with what works for you (or me!), and more with what works for the guy maintaining the code. If the code is never going to be seen or maintained by anyone else, follow your personal whim; if the code needs to be maintained by others, better to steer more towards the middle of the road. A minor (trivial!) compromise on your part now, might save someone else a week of debugging later on.
Update: On further consideration, I would suggest factoring out the condition as a separate predicate function would give still greater maintainability:
if (isOdd(num))
{
// Number is odd
}
You still have to be careful about things like this:
if ( num % 2 == 1 )
{
// Number is odd
}
If num is negative and odd then depending on the language or implementation num % 2 could equal -1. On that note, there is nothing wrong with checking for the falseness if it simplifies at least the syntax of the check. Also, using != is more clear to me than just !-ing the whole thing as the ! may blend in with the parenthesis.
To only check the trueness you would have to do:
if ( num % 2 == 1 || num % 2 == -1 )
{
// Number is odd
}
That is just an example obviously. The point is that if using a negation allows for fewer checks or makes the syntax of the checks clear then that is clearly the way to go (as with the above example). Locking yourself into checking for trueness does not suddenly make your conditional more readable.
I remember hearing the same thing in my classes as well. I think it's more important to always use the more intuitive comparison, rather than always checking for the positive condition.
Really a very in-consequential issue. However, one negative to checking in this sense is that it only works for binary comparisons. If you were for example checking some property of a ternary numerical system you would be limited.
Replying to Bevan (it didn't fit in a comment):
You're right. !foo isn't always the same as foo == false. Let's see this example, in JavaScript:
var foo = true,
bar = false,
baz = null;
foo == false; // false
!foo; // false
bar == false; // true
!bar; // true
baz == false; // false (!)
!baz; // true
I also disagree with your teacher in this specific case. Maybe he was so attached to the generally good lesson to avoid negatives where a positive will do just fine, that he didn't see this tree for the forest.
Here's the problem. Today, you listen to him, and turn your code into:
// Print black stripe on odd numbers
int zebra(int num) {
if (num % 2 == 1) {
// Number is odd
printf("*****\n");
}
}
Next month, you look at it again and decide you don't like magic constants (maybe he teaches you this dislike too). So you change your code:
#define ZEBRA_PITCH 2
[snip pages and pages, these might even be in separate files - .h and .c]
// Print black stripe on non-multiples of ZEBRA_PITCH
int zebra(int num) {
if (num % ZEBRA_PITCH == 1) {
// Number is not a multiple of ZEBRA_PITCH
printf("*****\n");
}
}
and the world seems fine. Your output hasn't changed, and your regression testsuite passes.
But you're not done. You want to support mutant zebras, whose black stripes are thicker than their white stripes. You remember from months back that you originally coded it such that your code prints a black stripe wherever a white strip shouldn't be - on the not-even numbers. So all you have to do is to divide by, say, 3, instead of by 2, and you should be done. Right? Well:
#define DEFAULT_ZEBRA_PITCH 2
[snip pages and pages, these might even be in separate files - .h and .c]
// Print black stripe on non-multiples of pitch
int zebra(int num, int pitch) {
if (num % pitch == 1) {
// Number is odd
printf("*****\n");
}
}
Hey, what's this? You now have mostly-white zebras where you expected them to be mostly black!
The problem here is how think about numbers. Is a number "odd" because it isn't even, or because when dividing by 2, the remainder is 1? Sometimes your problem domain will suggest a preference for one, and in those cases I'd suggest you write your code to express that idiom, rather than fixating on simplistic rules such as "don't test for negations".