In Luke, if I enter the search expression docfile:Tomatoes.jpg* the parsed query is docfile:Tomatoes.jpg*. When the search expression is docfile:Tomatoes.jpg, (no asterisk *) the parsed query is docfile:tomatoes.jpg with a lowercase 't'.
Why?
How can I change this?
BTW, using org.apache.lucene.analysis.standard.StandardAnalyzer.
StandardAnalyzer uses LowerCaseFilter which means it lowercases your queries and data. This is described in the Javadocs http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html.
If I remember correctly WhitespaceAnalyzer does not lowercase, but verify it suits your needs http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/WhitespaceAnalyzer.html.
For Lucene 5.3.0 the problem was solved by using the SimpleAnalyzer.
Example:
Analyzer analyzer = new org.apache.lucene.analysis.core.SimpleAnalyzer();
Finally, use the same analyzer for building the index and searching.
Related
While indexing my document using lucene Standard Analyzer I got a plroblem.
For example:
my document had a word "plag-iarism" ... here this analyzer indexed it as "plag" and "iarism". But I want like "plagiarism". What I have to do to get a whole word?
StandardAnalyzer delegates tokanization to StandardTokenizer.
You create your own tokanizer to match your exact needs (you could base it on StandardTokenizer).
Alternatively, if you prefer, you could do a dirty hack of a String.replace(), with the relevant regular expression, just the analyzer runs. Yeah. Ugly.
I was trying to use the Search module provided in Play framework, and ran into the restriction
"You cannot use a * or ? symbol as the first character of a search" based on the reference
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_2/queryparsersyntax.html
Is there any alternative to query for the substring keyword "ice" in String objects such as "nice, rice,.."
Thx
You can do this using a QueryParser by setting the setAllowLeadingWildcard flag.
QueryParser queryParser = new QueryParser(...);
queryParser.setAllowLeadingWildcard(true);
Query q = queryParser.parse("*ice");
But you should be aware of performance implications. From the docs:
Leading wildcards (e.g. *ook) are not supported by the QueryParser by
default.
As of Lucene 2.1, they can be enabled by calling QueryParser.setAllowLeadingWildcard( true ).
Note that this can be an expensive operation: it requires scanning the list of tokens in the index in its entirety to look for those that
match the pattern.
I am using the Lucene standard analyzer to parse text. however, it is returning prepositions as well as words like "i", "the", "and" etc...
Is there an Analyzer I can use that will not return these words?
Thanks
StandardAnalyzer uses StopFilter.
By default the words in the STOP_WORDS_SET are excluded. If this is not sufficient, there are constructors which allow you to pass in a list of stop words which should be removed from the token stream. You can provide the list using a File, a Set, or a Reader.
Am using MultiFieldQueryParser for parsing strings like a.a., b.b., etc
But after parsing, its removing the dots in the string.
What am i missing here?
Thanks.
I'm not sure the MultiFieldQueryParser does what you think it does. Also...I'm not sure I know what you're trying to do.
I do know that with any query parser, strings like 'a.a.' and 'b.b.' will have the periods stripped out because, at least with the default Analyzer, all punctuation is treated as white space.
As far as the MultiFieldQueryParser goes, that's just a QueryParser that allows you to specify multiple default fields to search. So with the query
title:"Of Mice and Men" "John Steinbeck"
The string "John Steinbeck" will be looked for in all of your default fields whereas "Of Mice and Men" will only be searched for in the title field.
What analyzer is your parser using? If it's StopAnalyzer then the dot could be a stop word and is thus ignored. Same thing if it's StandardAnalyzer which cleans up input (includes removing dots).
(Repeating my answer from the dupe. One of these should be deleted).
The StandardAnalyzer specifically handles acronyms, and converts C.F.A. (for example) to cfa. This means you should be able to do the search, as long as you make sure you use the same analyzer for the indexing and for the query parsing.
I would suggest you run some more basic test cases to eliminate other factors. Try to user an ordinary QueryParser instead of a multi-field one.
Here's some code I wrote to play with the StandardAnalyzer:
StringReader testReader = new StringReader("C.F.A. C.F.A word");
StandardAnalyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("title", testReader);
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
The output for this, by the way was:
(cfa,0,6,type=<ACRONYM>)
(c.f.a,7,12,type=<HOST>)
(word,13,17,type=<ALPHANUM>)
Note, for example, that if the acronym doesn't end with a dot then the analyzer assumes it's an internet host name, so searching for "C.F.A" will not match "C.F.A." in the text.
I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?
This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");
Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.
Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.
Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.
The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.