Lucene: Accessing payloads of results of a query - lucene

When I'm searching for a query in Lucene, I receive a list of documents as result. But how can I get the hits within those documents? I want to access the payload of those word, which where found by the query.
If your query contains only one term you can simply use TermPositions to access the payload of this term. But if you have a more complex query with Phrase Search, Proximity Search, ... you can't just search for the single terms in TermPositions.
I would like to receive a List<Token>, TokenStream or something similiar, which contains all the Tokens that were found by the query. Then I can iterate over the list and access the payload of each Token.

I solved my problem by using SpanQueries. Nearly every Query can be expressed as SpanQuery. A SpanQuery gives access to the spans where the hit within a document is. Because the normal QueryParser doesn`t produce SpanQueries, I had to write my own parser which only creates SpanQueries. Another option would be the SurroundParser from Lucene-Contrib, which also creates SpanQueries.

I think you'll want to start by looking at the Lucene Highlighter, as it highlights the matching terms in the document.

Related

Is there a way to properly experiment with Solr field-types?

I'm working with Solr for a basic search engine, and I've created a couple different fieldTypes that include various filters and tokenizers in their analyzer chains.
However, I'm finding it very difficult to assess how these components of the chain interact and when I query in the Solr Admin, I consistently get different results than I expect-- with no clue as to why.
Is there a way to see what a phrase like education:"x university" is being transformed into when I type it in the q section of the Admin?
Also, when the phrase goes through the chain can it be transformed into multiple things that are all searched or is it just a single modified phrase?
Thanks for any help!
Use Analysis in Solr Admin to check how each field and its type process the tokens both while querying and indexing.
Analyse Fieldname / FieldType:
from the drop down option select field/type that you want to analyse and clieck on Analyse values.
ex: what tokenizer used, which all filter classes applied to token and how token is transformed after passing each filter class.
if
Verbose Output is checked, it shows more details about each filter class used for the selected field/type.

Examine lucene.net custom query after analyzer tokenizes

I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?
I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).

Lucene - Which field contains search term?

I have developed a search application with Lucene. I have created the basic search. Basically, my app works as follows:
My index has many fields. (Around 40)
User can enter query to multiple fields i.e: +NAME:John +SURNAME:Doe
Queries can contain wildcards such as ? and * i.e: +NAME:J?hn +SURNAME:Do*
Queries can also contain fuzzy i.e: +NAME:Jahn~0.5
Now, I want to find, which field(s) contains my search term(s). As I am using wildcard and fuzzy, I cannot just make string comparison. How can I do it?
If you need it for debugging purposes, you could use IndexSearcher.explain.
Otherwise, this problem looks like highlighting, so you should be able to find out the fields that matched by:
re-analyzing your document,
or using its term vectors.

How to query neo4j lucene index with field grouping via REST?

We have data in our graph that is indexed by Lucene and need to query it with a
Field Grouping
The example in the link above shows the Lucene syntax to be:
title:(+return +"pink panther")
I can't figure out how to send a request like that via http to the REST interface. Specifically, the space in the second term is causing me problems.
Our data is actually a list and I need to match on multiple items:
regions:(+asia +"north america")
Anyone have any ideas?
Update: For the record, the following url encoded string works for this particular query:
regions%3A%28%2Basia+%2B%22north+america%22%29
Isn't it enough to just URL encode the query using java.net.URLEncoder or something?

How do i include other fields in a lucene search?

Lets use emails for an example as a document. You have your subject, body, the person who its from and lets say we can also tag them (as gmail does)
From my understanding of QueryParser i give it ONE field and the parser type. If a user enter text the user only searches whatever i set. I notice it will look in the subject or body field if i wrote fieldName: text to search however how do i make a regular query such as "funny SO question unicorn" find result(s) with some of those strings in the subject, the others in the body? ATM because i knew it would be easy i made a field called ALL and combined all the other fields into that but i would like to know how i can do it in a proper way. Especially since my next app is text search dependent
Use MultiFieldQueryParser. You can specify list of fields to be searched using following constructor.
MultiFieldQueryParser(Version matchVersion, String[] fields, Analyzer analyzer)
This will generate a query as if you have created multiple queries on different fields. This partially addresses your problem. This, still, will not match one term matching in field1 and another matching in field2. For this, as you have rightly pointed out, you will need to combine all the fields in one single field and search in that field. Nevertheless, you will find MultiFieldQueryParser useful when query terms do not cross the field boundaries.