I have the following code and would appreciate your advice.
QueryParser queryParser = new QueryParser(searchTerm, analyzer);
Query query = queryParser.parse(searchTerm);
My first question is, this "doubled"? As I have the "String to search for (=searchTerm)" in the constructor as well as in the parse() method. Is this really required? (For further usage i need a Query object). If i do it this way, does this maybe even introduce some negative side effects?
And I am not able to specify programmatically the "default field" to search for. In my queries I write "content:House" and it searches in the field "content". But how can I specify this programmatically that "content:" is my default field and a user only has to enter "House" (and lucene then automatically searches in the "content" field).
Thank you so much
jan
The first argument to the QueryParser constructor is the default search field, even if the javadoc doesn't make that obvious.
So you want this:
QueryParser queryParser = new QueryParser("content", analyzer);
Query query = queryParser.parse(searchTerm);
Related
I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(#"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.
I am using Lucene to index my database and then perform a phrase search on a specific field(field name: keyword).
I am using following code currently:
String userQuery = request.getParameter("query");
//create standard analyzer object
analyzer = new StandardAnalyzer(Version.LUCENE_30);
Analyzer analyze=AnalyzerUtil.getPorterStemmerAnalyzer(analyzer);
//create File object of our index directory
File file = new File(LUCENE_INDEX_DIRECTORY);
//create index reader object
reader = IndexReader.open(FSDirectory.open(file),true);
//create index searcher object
searcher = new IndexSearcher(reader);
//create topscore document collector
collector = TopScoreDocCollector.create(1000, false);
//create query parser object
parser = new QueryParser(Version.LUCENE_30,"keyword", analyze);
parser.setAllowLeadingWildcard(true);
//parse the query and get reference to Query object
query = parser.parse(userQuery);
//********Line 1***********************
//search the query
searcher.search(query, collector);
hits = collector.topDocs().scoreDocs;
//check whether the search returns any result
if(hits.length>0){//Code to retrieve hits}
This code works fine for stemming, but now I want to also expand my query to do synonym search like if I enter "Man" and my lucene index has a entry "male", it would still be able to give me that as a hit.
I tried to add this at Line 1 in the above code query=SynExpand.expand(userQuery,
searcher, analyze,"keyword",serialVersionUID);
But it doesn't give me any result.
I also want to introduce spell check, where in if I enter "ubelievable" instead of "unbelievable" it would still give me a result.
I have no idea why synonym expansion isn't working for me and how to do spelling check.Please if someone could guide me I will be really grateful.
Thanks!
Fuzzy search may be done by query keyword modifier, namely by adding tilde:
keyword:ubelievable~
See Lucene Parser Syntax for more details and other types of queries that may be interesting to you.
There are 2 ways of dealing with synonyms. Query expansion you are trying to use relies on WordNet. As SynExpand's documentation says, you should first invoke Syns2Index to use expansion. This is easy way, but it works only with English words.
If you need to add support for multiple languages or add your own synonyms, you can use synonym injection during indexing. The idea is to write your own analyzer that will inject synonyms from your own dictionary into indexed documents. This may sound hard to implement, but fortunately there's excellent example in Lucene in Action book (source code is available for free, see lia.analysis.synonym package. Though, I highly recommend to get your copy of this nice book).
I want to find all documents in the index that have a certain field, regardless of the field's value. If at all possible using the query language, not the API.
Is there a way?
If you know the type of data stored in your field, you can try a range query. Per example, if your field contain string data, a query like field:[a* TO z*] would return all documents where there is a string value in that field.
I've done some experimenting, and it seems the simplest way to achieve this is to create a QueryParser and call SetAllowLeadingWildcard( true ) and search for field:* like so:
var qp = new QueryParser( Lucene.Net.Util.Version.LUCENE_29, field, analyzer );
qp.SetAllowLeadingWildcard( true );
var query = qp.Parse( "*" ) );
(Note I am setting the default field of the QueryParser to field in its constructor, hence the search for just "*" in Parse()).
I cannot vouch for how efficient this method is over other methods, but being the simplest method I can find, I would expect it to be at least as efficient as field:[* TO *], and it avoids having to do hackish things like field:[0* TO z*], which may not account for all possible values, such as values starting with non-alphanumeric characters.
Another solution is using a ConstantScoreQuery with a FieldValueFilter
new ConstantScoreQuery(new FieldValueFilter("field"))
Am using MultiFieldQueryParser for parsing strings like a.a., b.b., etc
But after parsing, its removing the dots in the string.
What am i missing here?
Thanks.
I'm not sure the MultiFieldQueryParser does what you think it does. Also...I'm not sure I know what you're trying to do.
I do know that with any query parser, strings like 'a.a.' and 'b.b.' will have the periods stripped out because, at least with the default Analyzer, all punctuation is treated as white space.
As far as the MultiFieldQueryParser goes, that's just a QueryParser that allows you to specify multiple default fields to search. So with the query
title:"Of Mice and Men" "John Steinbeck"
The string "John Steinbeck" will be looked for in all of your default fields whereas "Of Mice and Men" will only be searched for in the title field.
What analyzer is your parser using? If it's StopAnalyzer then the dot could be a stop word and is thus ignored. Same thing if it's StandardAnalyzer which cleans up input (includes removing dots).
(Repeating my answer from the dupe. One of these should be deleted).
The StandardAnalyzer specifically handles acronyms, and converts C.F.A. (for example) to cfa. This means you should be able to do the search, as long as you make sure you use the same analyzer for the indexing and for the query parsing.
I would suggest you run some more basic test cases to eliminate other factors. Try to user an ordinary QueryParser instead of a multi-field one.
Here's some code I wrote to play with the StandardAnalyzer:
StringReader testReader = new StringReader("C.F.A. C.F.A word");
StandardAnalyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("title", testReader);
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
The output for this, by the way was:
(cfa,0,6,type=<ACRONYM>)
(c.f.a,7,12,type=<HOST>)
(word,13,17,type=<ALPHANUM>)
Note, for example, that if the acronym doesn't end with a dot then the analyzer assumes it's an internet host name, so searching for "C.F.A" will not match "C.F.A." in the text.
I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?
This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");
Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.
Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.
Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.
The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.