Lucene search or highlight field values in a document - lucene

The index document has some fields and each field has thousands of values. When the document is returned from the query, I want to further obtain the values of the "hit" contained in each field.
Currently, I use highlighter to loop through each field values
highlighter.setTextFragmenter(new SimpleFragmenter(1000));
final TokenStream tokenStream = analyzer.tokenStream(fieldName, new StringReader(fieldValue));
final CachingTokenFilter filter = new CachingTokenFilter(tokenStream);
final String fragment = highlighter.getBestFragment(filter, value);
if (fragment != null) {
//TODO
}
It works, but the performance is poor.
Is there a better way to achieve the same result and better performance.
Thanks in advance.
More details here
Each document has ten fields, field-1, field-2,...field-10, and each field has 1000 different values, such as field-1: apple, pear, banada,...
I use MultipleFieldQueryParser, when searching for "apple" for example, there are hits on field-1 and field-3, so in the first step I get the expected documents, and in the second step, I want to get fields-1: "apple", field-3: "2 apples" for the top hit document.
For the second step, I loop through each value of field-1, which means running highlight 1000 times. I think there may be a shortcut, such as applying a query or filter for the field-1 in the selected document and getting "apple", only one query is required

Related

Lucene calculate term vectors for existing index

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?
As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

Get the position of matches in Lucene

Is it possible to find the position of words with a match when the indexed field isn't stored?
for example:
Query: "fox over dog"
Indexed text of matched doc: "The quick brown fox jumps over the lazy dog"
What I want: [4,6,9]
Note1: I know text can be highlighted using Lucene but I want the position of the words
Note2: The field isn't set to be stored by Lucene**
I have not done this for practical purposes - just to give a pseudo code and pointers that you can experiment with to reach to correct solution.
Also, you have not specified your Lucene version, I am using Lucene 6.0.0 with Java.
1.While Indexing, set these two booleans for your specific field for which positions are desired. Lucene will be able to give that data if indexing has stored that information otherwise not.
FieldType txtFieldType = new FieldType(
TextField.TYPE_NOT_STORED);
txtFieldType.setStoreTermVectors(true);
txtFieldType.setStoreTermVectorPositions(true);
2.At your searcher, you need to use Terms , TermsEnum & PostingsEnum like below,
`Terms terms = searcher.getIndexReader().getTermVector(hit.doc, "TEXT_FIELD");`
if(terms.hasPositions()){
TermsEnum termsEnum = terms.iterator();
PostingsEnum postings = null;
while(termsEnum.next() != null){
postings = termsEnum.postings(postings ,PostingsEnum.ALL);
while(postings.nextDoc() != PostingsEnum.NO_MORE_DOCS){
System.out.println(postings.nextPosition());
}
You need to do some of your own analysis to arrive at the data that you need but your first need to save meta data as pointed in point # 1.
}
}
searcher is IndexSearcher instance, hit.doc is doc id and hit is a ScoreDoc .

Search by exact words in a phrase using Umbraco Examine

I have some description field per content and those are:
For content1:
The quick brown fox jumps over the lazy dog. And the lazy dog is good.
For content2:
The lazy fog is crazy.
Now, when I use keyword = lazy dog, I want to give result as content1 and not content2
I tried like:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog") )
.Compile();
ISearchResults result = searcher.Search( criteria );
But it didn't gave me desired results, it give me results: content1 and content2.
What should I do in order to get as content1 result ?
By default examine is compiling this query to:
+(+description:lazy dog)
and based on it it's returning the results with both: lazy and dog words.
What you want to achieve is:
+(+description:"lazy dog")
First of what you need to try is to escape the phrase. In your case it will be:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog".Escape()) )
.Compile();
ISearchResults result = searcher.Search( criteria );
Can't test it now, but there were some problems with it in the past from what I remember. The second option and a life saver for you, may be building the search query manually and using the raw query.
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria = searcher.CreateSearchCriteria();
var query = criteria.RawQuery("+description:\"lazy dog\"");
ISearchResults result = searcher.Search( query );
And it should return you correct = matched result only. Personally, I've used also some boosting of specific words to just point some results higher in the score list, but if you want to have only matched items, try above solutions and let me know if it helped you.
If you want to deal with more than one property, you can either use some fluent API methods like GroupedAnd or GroupedOr (depending of the desired behaviour of search) or build more advanced raw query.
For the first option, check Grouped Operations documentation: https://github.com/Shazwazza/Examine/wiki/Grouped-Operations.
For the second scenario it would be the best to analyze how it's done e.g. in ezSearch package (which btw. is awesome!): https://github.com/umco/umbraco-ezsearch/blob/master/Src/Our.Umbraco.ezSearch/Web/UI/Views/MacroPartials/ezSearch.cshtml.

Getting the last item in Sitecore content data

I am performing a search in which I have to get the 'ID' (field) of the last item stored in the sitecore/content/data/MyItem. The items stored in this folder are in 1000+ in number. I know Lucene search is by far efficient. I performed a Lucene Search to get the items based on the value like this:
using (IndexSearchContext searchContext = indx.CreateSearchContext())
{
var db = Sitecore.Context.Database;
CombinedQuery query = new CombinedQuery();
QueryBase catQuery = new FieldQuery("country", guidValueToSearch); //FieldName, FieldValue.
SearchHits results = searchContext.Search(catQuery); //Searching the content items by fields.
SearchResultCollection result = results.FetchResults(0, numOfArticles);
Here I am passing the guidValueToSearch for the items needs to be fetched for "country" field value. But I want to get the last item in the folder. How should I achieve this?
If you know you need the last childitem of /sitecore/content/data/MyItem, you could also use a more simple approach and get the parentItem and then retrieve the last child:
Item myItem = Sitecore.Context.Database.GetItem("/sitecore/content/data/MyItem");
Item lastItem = myItem.Children.Last();
The same could be done with Descendants instead of Children.
If you did want to implement it using search then have a look at this answer which explains how to extend the IndexSearchContext to have methods that accept a Lucene.Net.Search.Sort. You can then pass in the Sitecore.Search.BuiltinFields.Created or Sitecore.Search.BuiltinFields.Updated field (depending on what you are after).

emit every document in the database with lucene

I've got an index where I need to get all documents with a standard search, still ranked by relevance, even if a document isn't a hit.
My first idea is to add a field that is always matched, but that might deform the relevance score.
Use a BooleanQuery to combine your original query with a MatchAllDocsQuery. You can mitigate the effect this has on scoring by setting the boost on the MatchAllDocsQuery to zero before you combine it with your main query. This way you don't have to add an otherwise bogus field to the index.
For example:
// Parse a query by the user.
QueryParser qp = new QueryParser(Version.LUCENE_35, "text", new StandardAnalyzer());
Query standardQuery = qp.parse("User query may go here");
// Make a query that matches everything, but has no boost.
MatchAllDocsQuery matchAllDocsQuery = new MatchAllDocsQuery();
matchAllDocsQuery.setBoost(0f);
// Combine the queries.
BooleanQuery boolQuery = new BooleanQuery();
boolQuery.add(standardQuery, BooleanClause.Occur.SHOULD);
boolQuery.add(matchAllDocsQuery, BooleanClause.Occur.SHOULD);
// Now just pass it to the searcher.
This should give you hits from standardQuery followed by the rest of the documents in the index.