How to implement EXCEPT boolean operator of ISYS using Lucene API - lucene

I've studied that EXCEPT is a boolean operator for queries in ISYS(which is an Enterprise search Engine).It has the following functionality.
If this is the query First EXCEPT Second------>The retrieved documents must contain the first search term but only if the second term is not in the same paragraph as the first. Both terms can appear in the document; just not in the same paragraph.
Now how do I achieve this in Lucene?
Thank you :)

A rough outline of an implementation strategy would be to:
tokenize your input on paragraphs
index each paragraph separately, with a field referring a common document identifier
use the BooleanQueryto construct a query that takes advantage of the above construction

Related

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

Lucene boost case sensitive match

Our client uses a few acronyms on its site.
For example, let's say STACK is an acronym they are using.
When they are searching for "STACK" (keyword), they want documents that match "STACK" exactly (uppercase) to be on top of the search results, instead of documents matching "stack" lowercase.
Is there a way to achieve this? Maybe through query boosting somehow?
I'm using the StandardAnalyzer at the moment.
From the docs:
public final class StandardAnalyzer extends StopwordAnalyzerBase
Filters StandardTokenizer with StandardFilter, LowerCaseFilter and
StopFilter, using a list of English stop words.
So you have no difference between 'STACK' and 'stack'. You could add the keywords again as StringField that exactly matches the keywords you search for and boost the keyword field.
It is really hard to make any judgements about this one case (STACK vs stack). If all your acronyms are upper case, just exclude LowerCaseFilter from the analyzer chain. If some of your acronyms can contain dots or dashes (e.g. Y.M.C.A.), you probably need to use WhitespaceAnalyzer (instead of StandardAnalyzer) to ensure these are not split into multiple terms.
To me, boosting sounds superfluous here. Say if somebody enters a query closely matching the acronym, the relevant document will be ranked high anyway because of its similarity.

Examine lucene.net custom query after analyzer tokenizes

I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?
I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).

Lucene Synonym Filter behavior

I am trying to figure out how does lucene's analyzer work?
My question is how does lucene handle synonym words? Here is the situation:
we have single words and multi words
single: foo = bar
multi words: foo bar = foobar
For single words:
Does lucene expand the indexed records or not? I guess if a query has a word like "foo", it adds "bar" to the query too. I don't know if it happens for indexing or not?
For multi words:
Does lucene expand both query and indexing? for example if we have "foo bar", does it add foobar to the indexing/query?
My second question is : Lucene uses a stream of tokens and gives them to the filters like lowercase filter. My question is how does lucene find the multi words? like how does it find out that "foo bar" is a multi words that are together?
thanks
SynonymFilter can, optionally, keep the original word, and add the synonym to the tokenstream as well, by setting keepOrig=true (see SynonymMap.Builder.add()). This behavior can cause problems for PhraseQueries and the like, see first Note on the SynonymFilter docs.
If you are using the same Analyzer for querying and indexing, then both queries and docs written to the index will, of course, be treated the same way. SynonymFilter with keepOrig set to true is one of the few Analyzers that is reasonably often applied incongruously between querying and indexing, but that is entirely up to your implementation.
As far as how it is implemented, the source code is available to you.

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".