I am trying to figure out how does lucene's analyzer work?
My question is how does lucene handle synonym words? Here is the situation:
we have single words and multi words
single: foo = bar
multi words: foo bar = foobar
For single words:
Does lucene expand the indexed records or not? I guess if a query has a word like "foo", it adds "bar" to the query too. I don't know if it happens for indexing or not?
For multi words:
Does lucene expand both query and indexing? for example if we have "foo bar", does it add foobar to the indexing/query?
My second question is : Lucene uses a stream of tokens and gives them to the filters like lowercase filter. My question is how does lucene find the multi words? like how does it find out that "foo bar" is a multi words that are together?
thanks
SynonymFilter can, optionally, keep the original word, and add the synonym to the tokenstream as well, by setting keepOrig=true (see SynonymMap.Builder.add()). This behavior can cause problems for PhraseQueries and the like, see first Note on the SynonymFilter docs.
If you are using the same Analyzer for querying and indexing, then both queries and docs written to the index will, of course, be treated the same way. SynonymFilter with keepOrig set to true is one of the few Analyzers that is reasonably often applied incongruously between querying and indexing, but that is entirely up to your implementation.
As far as how it is implemented, the source code is available to you.
Related
I have a system where the search queries multiple fields with different boost values. It is running on Lucene.NET 2.9.4 because it's an Umbraco (6.x) site, and that's what version of Lucene.NET the CMS uses.
My client asked me if I could add stemming, so I wrote a custom analyzer that does Standard / Lowercase / Stop / PorterStemmer. The stemming filter seems to work fine.
But now, when I try to use my new analyzer with the MultiFieldQueryParser, it's not finding anything.
The MultiFieldQueryParser is returning a query containing stemmed words - e.g. if I search for "the figure", what I get as part of the query it returns is:
keywords:figur^4.0 Title:figur^3.0 Collection:figur^2.0
i.e. it's searching the correct fields and applying the correct boosts, but trying to do an exact search on stemmed terms on indexes that contained unstemmed words.
I think what's actually needed is for the MultiFieldQueryParser to return a list of clauses which are of type PrefixQuery. so it'll output a query like
keywords:figur*^4.0 Title:figur*^3.0 Collection:figur*^2.0
If I try to just add a wildcard to the end of the term, and feed that into the parser, the stemmer doesn't kick in. i.e. it builds a query to look for "figure*".
Is there any way to combine MultiFieldQueryParser boosting and prefix queries?
You need to reindex using your custom analyzer. Applying a stemmer only at query time is useless. You might kludge together something using wildcards, but it would remain an ugly, unreliable kludge.
Our client uses a few acronyms on its site.
For example, let's say STACK is an acronym they are using.
When they are searching for "STACK" (keyword), they want documents that match "STACK" exactly (uppercase) to be on top of the search results, instead of documents matching "stack" lowercase.
Is there a way to achieve this? Maybe through query boosting somehow?
I'm using the StandardAnalyzer at the moment.
From the docs:
public final class StandardAnalyzer extends StopwordAnalyzerBase
Filters StandardTokenizer with StandardFilter, LowerCaseFilter and
StopFilter, using a list of English stop words.
So you have no difference between 'STACK' and 'stack'. You could add the keywords again as StringField that exactly matches the keywords you search for and boost the keyword field.
It is really hard to make any judgements about this one case (STACK vs stack). If all your acronyms are upper case, just exclude LowerCaseFilter from the analyzer chain. If some of your acronyms can contain dots or dashes (e.g. Y.M.C.A.), you probably need to use WhitespaceAnalyzer (instead of StandardAnalyzer) to ensure these are not split into multiple terms.
To me, boosting sounds superfluous here. Say if somebody enters a query closely matching the acronym, the relevant document will be ranked high anyway because of its similarity.
I'm working in Lucene 4.6 and i'm trying to look for records that contains "keyword1" in "field1" and "keyword2" in "field2"
I wrote following query:
Query q = MultifieldQueryParser.parse(
Version.Lucene_46,
new String[] {keyword1, keyword2},
new String[]{"field1","field2"},
new StandardAnalyzer()
);
That gives me some results but I want to have something like %keyword1% , %keyword2% in SQL.
Thanks for your answers. In case I have a field with the value "Lucene Game Lucene" and I'm looking for that document using the keyword "Game" I can't get that result using keyword neither keyword Who have any idea about this?
You can use WildcardQuery. Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. \ is the escape character.
You can also use the wildcard as prefix, for example *nix, but that can very slow on large indexes, because Lucene needs to scan the entire list of Terms.
[edit]
If you need a prefix wildcard in the queryparser, make sure to call setAllowLeadingWildcard(true)
on the QueryParser As can be seen here
WildcardQuery in Lucene provides the possibility to search for keyword%. For the other way arount there is some work to be done during indexing. You need to index the terms in reversed form (in an other field) and perform the query drowyek%.
I've studied that EXCEPT is a boolean operator for queries in ISYS(which is an Enterprise search Engine).It has the following functionality.
If this is the query First EXCEPT Second------>The retrieved documents must contain the first search term but only if the second term is not in the same paragraph as the first. Both terms can appear in the document; just not in the same paragraph.
Now how do I achieve this in Lucene?
Thank you :)
A rough outline of an implementation strategy would be to:
tokenize your input on paragraphs
index each paragraph separately, with a field referring a common document identifier
use the BooleanQueryto construct a query that takes advantage of the above construction
I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".