Retrieving per keyword/field match position in Lucene Solr -- possible? - lucene

Is there any way to retrieve the match field/position for each keyword for each matching document from solr?
For example, if the document has title "Retrieving per keyword/field match position in Lucene Solr -- possible?" and the query is "solr keyword", I'd like to get, in addition to the doc-id (I normally only want the doc-id, not the full document), something that can tell me the matches are at:
solr:
title: 9
keyword:
title: 3
I'm pretty sure such info is computing during query execution (for phrase queries), but is it possible to return these to the application?
Thanks!

Debugging Relevance Issues in Search suggest using Solr analysis, which you can get to from the admin URL, using something like http://localhost:8983/solr/admin/analysis.jsp?highlight=on .
This highlights matching terms and gives their position.

AFAIK there is no way to do that directly, but you can use hit highlighting to implement it.

Related

Cloudant Search: Why are my wildcards not working?

I have a Cloudant database with a search index. In the search index I index the titles of my documents. For instance, search for 'rijkspersoneel':
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspersoneel
Returns 48 rows.
However, when I replace the 'o' with a ? wildcard:
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspers?neel
I get 0 results. Why is that? The Cloudant docs say that this should match 'rijkspersoneel' as well!
My previous answer was definitely mistaken. Internal wildcads do appear to be supported. Try:
title:rijkspe*on*
title rijksper?on*
Fairly sure what is happening here is an analysis issue. Fairly sure you are using a stemming analyzer. I'm not really all the familiar with cloudant and their implementation of this, but in Lucene, wildcard queries are not subject to the same analysis as term queries. I'm guessing that your analysis of this field includes a stemmer, in which case "rijkspersoneel" is actually indexed as "rijkspersone".
So, when you search for
rijkspersonee*
or
rijkper?oneel
Since the "el" is missing from the end in the indexed form, you find no matches. When just searching for rijkpersoneel it does get analyzed though, and you search for the stemmed form of the word, and will find matches.
Stemming and wildcards just don't get along.

How do I get accurate search result in Lucene using Query syntax

So far I have been testing the keywords that I inputted in Sitecore using the query syntax but the search result does not rank the page first.
For example if I put query syntax on the word book....(title:book)^1
I want the index page that is name book to appear first in the search result and not bookmark.
Also, every time I publish a new page in Sitecore the keywords for the word Book get push down to the last result or doesn't appear in the search page.
How do I get accurate result in Lucene for the search engine page?
Also I've been following http://www.lucenetutorial.com/lucene-query-syntax.html about how to increase search result but it doesn't work.
Can someone explain how the boost of the search term works.
I recommend you leverage the Advanced Database Crawler to get the best use of Lucene.NET with Sitecore. From that, there's a config file for the indexes with a section called <dynamicFields ... >. In that section, you can specify an individual Sitecore field and adjust the boost attribute. The default boost for every field is 1f which is 1 floating point.
More reading:
Sitecore Searcher and Advanced Database Crawler
Source code for the ADC

How to find href=blah but not href=/blah with Full-text search

I'm currently using the query
SELECT Url FROM Link WHERE CONTAINS(Url, 'href=blah')
It is including results with href=/blah. Any way I can tell the query to act more like WHERE Url LIKE '%href=blah%' and still use the full-text catalog?
Your problem is that = and / are both word breakers, in other words, sql fulltext is actually searching for href and blah
There are a couple of options you could try. First you could filter down the search domain using the fulltext engine, then search the subset of data using LIKE. You'll need to experiment to see how to squeeze out the best performance.
The other option is, if href=blah is a consistent term you could add that to a custom dictionary. A good article on this is here.

How to get the results which has all the strings specified in the search query

I am a beginner in Lucene. I am writing a search engine to search our code base for certain key words. I have a requirement for which I need your help. Say I am searching for a word "Apple computers", I would like Lucene to throw only the lines which have case insesitive "apple computers". But what I see is I see lines having Apple computers, lines having only apple and lines having only computers. How do I filter it to get only the lines having apple and computer.
How do you query Lucene? Basically what you are asking about is covered by building a query using BooleanClause.Occur.MUST.
Exactly how to do this is dependent on your query construction: For the default query parser you should use something like
+Apple +computers
While if you are building queries programmatically you should use MUST for every term.
As Yuval suggested, it's important to know how do you use Lucene.
If you use it through lucene-java and need exact phrase results (docs that contain only "apple computers" together) you can use PhraseQuery.
The example of how to compose it.

Getting exact matches in Lucene using the standard analyzer?

Given 2 documents with the content as follows
"I love Lucene"
"Lucene is nice"
I want to be able to query lucene only for those documents with Lucene in the beginning , i.e , everything that will match the regexp "^Lucene .*".
Is there a way to do it , provided that I can't change the index , and it was analyzed using the standard analyzer?
Sure, take a look at SpanFirstQuery. Here is a good tutorial:
http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/