I am using Lucene to perform spell checking. I am using
https://lucene.apache.org/core/5_4_1/suggest/org/apache/lucene/search/spell/SpellChecker.html
What I really want is, for example, I have a word:
spellingmistake
And now I type:
speli.
In this case, I want to spellchecker to return me the correct word or at least return spell. So to achieve this while indexing the dictionary in the indexwriter config I used a EdgeNGramTokenizer assuming it would work for this case. But unfortunately, it is not working.
How do I get this working?
Thanks.
Related
I'm processing some Indonesian texts in a Java application, and I need to stem them.
Currently I am using lucene indonesian stemmer.
org.apache.lucene.analysis.id.IndonesianAnalyzer;
but results are not satisfactory.
Could anyone suggest me different stemmer?
"enang" is a stem. Stems need not be actual words. For instance, in English, "argue" "argues" and "arguing" reduce to the stem "argu". "argu" isn't an english word, but it is a meaningful stem. This is how stemmers work. As long as you apply the stemmer the same way to the indexed data and the query, it should work well.
If you don't want behavior like that, it doesn't make any sense to use a stemmer at all.
Aside from the stemmer, IndonesianAnalyzer is fairly easily replicated. It's other components just involve a StandardTokenizer, StandardFilter, LowercaseAnalyzer, and a StopFilter. That's just a StandardAnalyzer with an Indonesian stopword set, when you get right down to it, so you can create an Indonesiananalyzer without the stemmer as simply as:
//If you are using the default stopword location defined in the IndonesianAnalyzer you could load them like this.
CharArraySet defaultStopSet = StopwordAnalyzerBaseloadStopwordSet(false, IndonesianAnalyzer.class, IndonesianAnalyzer.DEFAULT_STOPWORD_FILE, "#");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43, defaultStopSet);
I'm not sure whether you would run into problems just passing a reader on the default stop word file into the StandardAnalyzer constructor.
While indexing my document using lucene Standard Analyzer I got a plroblem.
For example:
my document had a word "plag-iarism" ... here this analyzer indexed it as "plag" and "iarism". But I want like "plagiarism". What I have to do to get a whole word?
StandardAnalyzer delegates tokanization to StandardTokenizer.
You create your own tokanizer to match your exact needs (you could base it on StandardTokenizer).
Alternatively, if you prefer, you could do a dirty hack of a String.replace(), with the relevant regular expression, just the analyzer runs. Yeah. Ugly.
I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(#"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.
I am trying to teach myself Lucene.Net to implement on my site. I understand how to do almost everything I need except for one issue. I am trying to figure out how to allow a fuzzy search for all search terms in a search string.
So for example if I have a document with the string The big red fox, I am trying to get bag fix to match it.
The problem is, it seems like in order to perform fuzzy searches, I have to add ~ to every search term the user enters. I am unsure of the best way to go about this. Right now I am attempting this by
string queryString = "bag rad";
queryString = queryString.Replace("~", string.Empty).Replace(" ", "~ ") + "~";
The first replace is due to Lucene.Net throwing an exception if the search string has a ~ already, apparently it can't handle ~~ in a phrase. This method works, but it seems like it will get messy if I start adding fuzzy weight values.
Is there a better way to default all words to allow for fuzzyness?
You might want to index your documents as bi-grams or tri-grams. Take a look at the CJKAnalyzer to see how they do it. You will want to download the source and look at the source.
I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?
This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");
Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.
Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.
Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.
The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.