Examine lucene.net custom query after analyzer tokenizes - lucene

I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?

I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).

Related

Operator LIKE in LUCENE

I'm working in Lucene 4.6 and i'm trying to look for records that contains "keyword1" in "field1" and "keyword2" in "field2"
I wrote following query:
Query q = MultifieldQueryParser.parse(
Version.Lucene_46,
new String[] {keyword1, keyword2},
new String[]{"field1","field2"},
new StandardAnalyzer()
);
That gives me some results but I want to have something like %keyword1% , %keyword2% in SQL.
Thanks for your answers. In case I have a field with the value "Lucene Game Lucene" and I'm looking for that document using the keyword "Game" I can't get that result using keyword neither keyword Who have any idea about this?
You can use WildcardQuery. Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. \ is the escape character.
You can also use the wildcard as prefix, for example *nix, but that can very slow on large indexes, because Lucene needs to scan the entire list of Terms.
[edit]
If you need a prefix wildcard in the queryparser, make sure to call setAllowLeadingWildcard(true)
on the QueryParser As can be seen here
WildcardQuery in Lucene provides the possibility to search for keyword%. For the other way arount there is some work to be done during indexing. You need to index the terms in reversed form (in an other field) and perform the query drowyek%.

Lucene - Which field contains search term?

I have developed a search application with Lucene. I have created the basic search. Basically, my app works as follows:
My index has many fields. (Around 40)
User can enter query to multiple fields i.e: +NAME:John +SURNAME:Doe
Queries can contain wildcards such as ? and * i.e: +NAME:J?hn +SURNAME:Do*
Queries can also contain fuzzy i.e: +NAME:Jahn~0.5
Now, I want to find, which field(s) contains my search term(s). As I am using wildcard and fuzzy, I cannot just make string comparison. How can I do it?
If you need it for debugging purposes, you could use IndexSearcher.explain.
Otherwise, this problem looks like highlighting, so you should be able to find out the fields that matched by:
re-analyzing your document,
or using its term vectors.

Lucene: Accessing payloads of results of a query

When I'm searching for a query in Lucene, I receive a list of documents as result. But how can I get the hits within those documents? I want to access the payload of those word, which where found by the query.
If your query contains only one term you can simply use TermPositions to access the payload of this term. But if you have a more complex query with Phrase Search, Proximity Search, ... you can't just search for the single terms in TermPositions.
I would like to receive a List<Token>, TokenStream or something similiar, which contains all the Tokens that were found by the query. Then I can iterate over the list and access the payload of each Token.
I solved my problem by using SpanQueries. Nearly every Query can be expressed as SpanQuery. A SpanQuery gives access to the spans where the hit within a document is. Because the normal QueryParser doesn`t produce SpanQueries, I had to write my own parser which only creates SpanQueries. Another option would be the SurroundParser from Lucene-Contrib, which also creates SpanQueries.
I think you'll want to start by looking at the Lucene Highlighter, as it highlights the matching terms in the document.

How to implement EXCEPT boolean operator of ISYS using Lucene API

I've studied that EXCEPT is a boolean operator for queries in ISYS(which is an Enterprise search Engine).It has the following functionality.
If this is the query First EXCEPT Second------>The retrieved documents must contain the first search term but only if the second term is not in the same paragraph as the first. Both terms can appear in the document; just not in the same paragraph.
Now how do I achieve this in Lucene?
Thank you :)
A rough outline of an implementation strategy would be to:
tokenize your input on paragraphs
index each paragraph separately, with a field referring a common document identifier
use the BooleanQueryto construct a query that takes advantage of the above construction

Search literal within a word

Is there a way to perform a FULLTEXT search which returns literals found within words?
I have been using MATCH(col) AGAINST('+literal*' IN BOOLEAN MODE) but it fails if the text is like:
blah,blah,literal,blah
blahliteralblah
blah,blah,literal
Please Note that there is no space after commas.
I want all three cases above to be returned.
I think that should be better fetching the array of entries and then perform a text manipulation over the fetched data (in this case a search)!
Because any text manipulation or complex query take more resources and if your database contains a lot of data, the query become too slow! Moreover, if you are running your
query on a shared server, that increases the performance issues!
You can easily accomplish what you are trying to do with regex, once you have fetched the data from the database!
UPDATE: My suggestion is the same even if you are running your script on a dedicated server! However, if you want to perform a full-text search of the word "literal" in BOOLEAN MODE like you have described, you can remove the + operator (because you are searching only one word) and construct the query as follow:
SELECT listOfColumsNames WHERE
MATCH (colName)
AGAINST ('literal*' IN BOOLEAN MODE);
However, even if you add the AND operator, your query works fine: tested on Apache Server with MySQL 5.1!
I suggest you to read the documentation about the full-text search in boolean mode.
The only one problem of this query is that doesn't matches the word "literal" if it is a sub-string inside an other word, for example: "textliteraltext".
As you noticed, you can't use the * operator at the beginning of the word!
So, to accomplish what you are trying to do, the fastest and easiest way is to follow the suggestion of Paul, using the % placeholder:
SELECT listOfColumsNames
WHERE colName LIKE '%literal%';