How can I perform a fuzzy search for all words provided in a Lucene.net search - lucene

I am trying to teach myself Lucene.Net to implement on my site. I understand how to do almost everything I need except for one issue. I am trying to figure out how to allow a fuzzy search for all search terms in a search string.
So for example if I have a document with the string The big red fox, I am trying to get bag fix to match it.
The problem is, it seems like in order to perform fuzzy searches, I have to add ~ to every search term the user enters. I am unsure of the best way to go about this. Right now I am attempting this by
string queryString = "bag rad";
queryString = queryString.Replace("~", string.Empty).Replace(" ", "~ ") + "~";
The first replace is due to Lucene.Net throwing an exception if the search string has a ~ already, apparently it can't handle ~~ in a phrase. This method works, but it seems like it will get messy if I start adding fuzzy weight values.
Is there a better way to default all words to allow for fuzzyness?

You might want to index your documents as bi-grams or tri-grams. Take a look at the CJKAnalyzer to see how they do it. You will want to download the source and look at the source.

Related

Lucene part:123 vs part 123 vs part123

I'm not too familiar with Lucene so my apologies if this isn't clear or I'm getting my terms/nomenclature mixed up.
So I have a requirement where a field containing text (example part:123) should be able to be found via the following:
part:123
part 123
part123
Now my understanding of StandardAnalyzer is that it will break the word "part:123" into terms "part" and "123".
So with that, I'm able to search with part:123 or part 123, but because they're two different terms "part123" won't work.
It seems to me like I'd also need to get the indexer to add another term where both are combined, so it'd be "part", "123", "part123".
Is this the right way to accomplish this -- and does anyone know how I'd go about implementing this?
Thanks!

How to index only words with a minimum length using Apache Lucene 5.3.1?

may someone give me a hint on how to index only words with a minimum length using Apache Lucene 5.3.1?
I've searched through the API but didn't find anything which suits my needs except this, but I couldn't figure out how to use that.
Thanks!
Edit:
I guess that's important info, so here's a copy of my explanation of what I want to achieve from my reply below:
"I don't intend to use queries. I want to create a source code summarization tool for which I created a doc-term matrix using Lucene. Now it also shows single- or double-character words. I want to exclude them so they don't show up in the results as they have little value for a summary. I know I could filter them when outputting the results, but that's not a clean solution imo. An even worse would be to add all combinations of single- or double-character words to the stoplist. I am hoping there is a more elegant way then one of those."
You should use a custom Analyzer with LengthTokeFilter. E.g.
Analyzer ana = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("standard")
.addTokenFilter("lowercase")
.addTokenFilter("length", "min", "4", "max", "50")
.addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
.build();
But it is better to use a stopword (words what occur in almost all documents, like articles for English language) list. This gives a more accurate result.

How exact phrase search is performed by a Search Engine?

I am using Lucene to search in a Data-set, I need to now how "" search (I mean exact phrase search) mechanism has been implemented?
I want to make it able to result all "little cat" hits when the user enters "littlecat". I now that I should manipulate the indexing code, but at least I should now how the "" search works.
I want to make it able to result all "little cat" hits when the user enters "littlecat"
This might sound easy but it is very tough to implement. For a human being little and cat are two different words but for a computer it does not know little and cat seperately from littlecat, unless you have a dictionary and your code check those two words in dictionary. On the other hand searching for "little cat" can easily search for "littlecat" aswell. And i believe that this goes beyong the concept of an exact phrase search. Exact phrase search will only return littlecat if you search for "littlecat" and vice versa. Even google seemingly (expectedly too), doesnt return "little cat" on littlecat search
A way to implement this is Dynamic programming - using a dictionary/corpus to compare your individual words against(and also the left over words after you have parsed the text into strings).
Think of it like you were writing a custom spell-checker or likewise. In this, there's also a scenario when more than one combination of words may be left over eg -"walkingmydoginrain" - here you could break the 1st word as "walk", or as "walking" , and this is the beauty of DP - since you know (from your corpus) that you can't form legitimate words from "ingmydoginrain" (ie rest of the string - you have just discovered that in this context - you should pick the segmented word as "Walking" and NOT walk.
Also think of it like not being able to find a match is adding to a COST function that you define, so you should get optimal results - meaning you can be sure that your text(un-separated with white spaces) will for sure be broken into legitimate words- though there may be MORE than one possible word sequences in that line(and hence, possibly also intent of the person seeking this)
You should be able to find pretty good base implementations over the web for your use case (read also : How does Google implement - "Did you mean" )
For now, see also -
How to split text without spaces into list of words?

Apache lucene and text meaning

I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(#"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.

How to make Lucene match all words in query?

I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?
This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");
Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.
Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.
Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.
The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.