Lucene: How to restrict the number of hits?

Lucene: How to restrict the number of hits? - lucene

I have a webpage form which carries out a search of all the photos that users have uploaded to the website. The problem is that the Lucene search is currently retrieving all photos that meet the search criteria even though we are only displaying 21 photos on the page. This is causing serious performance issues. Is it possible to limit the number of photos retrieved to 21, in order to improve performance?
In the same way that we can restrict searches to a specific category by using eg (Category: New), is there a similar way to restrict the number of hits?

This is what I do:
The search method has number of results as a parameter. I pass pageSize*page.
So for page 1, I get only pageSize documents.
Then I only load the document (using searcher.doc()) for the page that I need.
TopDocs hits = searcher.search(lucene_query, pageSize*(page));
ScoreDoc[] scoreDocs = hits.scoreDocs;
int j = startIndex;
int rem = 0;
while (j < scoreDocs.length && (endIndex==0 || j<endIndex)) {
ScoreDoc sd = scoreDocs[j];
Document d = searcher.doc(sd.doc);
}

Related

Lucene calculate term vectors for existing index

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?

As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

With Lucene, determine which name in a SQL Table exists in a document

I have a SQL table with people’s names and I want to find out which name is in a document that I’ve indexed with Lucene.
Is there a way find out other than searching for each name individually?

I guess, you can achieve that by using WildcardQuery and searching only for "*" string.
TopDocs hits = searcher.search(new WildcardQuery(new Term(AppConstants.NAME, "*")), 20);
if (null == hits.scoreDocs || hits.scoreDocs.length <= 0) {
System.out.println("No Hits Found");
return;
}
for (ScoreDoc hit : hits.scoreDocs) {
Document doc = searcher.doc(hit.doc);
.......build a list of names or whatever or cross check with DB table etc etc......
}
In above code, AppConstants.NAME is name field name and its assumed that reader and searcher have already been initialized.
In place of hit limit of 20 , you can specify any number that you want depending on number of rows in table.
Be prepared that search with wild card will be slow and depends heavily on your index size.
When asking question on Lucene, always specify Lucene version and API technology - Java or .NET.
Above Java code works for me with Lucene 6.0.0.

How to get the matching spans of a Span Term Query in Lucene 5?

In Lucene to get the words around a term it is advised to use Span Queries. There is good walkthrough in http://lucidworks.com/blog/accessing-words-around-a-positional-match-in-lucene/
The spans are supposed to be accessed using the getSpans() method.
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
Spans spans = fleeceQ.getSpans(searcher.getIndexReader());
Then in Lucene 4 the API changed and the getSpans() method got more complex, and finally, in the latest Lucene release (5.3.0), this method was removed (apparently moved to the SpanWeight class).
So, which is the current way of accessing spans matched by a span term query?

The way to do it would be as follows.
LeafReader pseudoAtomicReader = SlowCompositeReaderWrapper.wrap(reader);
Term term = new Term("field", "fox");
SpanTermQuery spanTermQuery = new SpanTermQuery(term);
SpanWeight spanWeight = spanTermQuery.createWeight(is, false);
Spans spans = spanWeight.getSpans(pseudoAtomicReader.getContext(), Postings.POSITIONS);
The support for iterating over the spans via span.next() is also gone in version 5.3 of Lucene. To iterate over the spans you can do
int nxtDoc = 0;
while((nxtDoc = spans.nextDoc()) != spans.NO_MORE_DOCS){
System.out.println(spans.toString());
int id = nxtDoc;
System.out.println("doc_id="+id);
Document doc = reader.document(id);
System.out.println(doc.getField("field"));
System.out.println(spans.nextStartPosition());
System.out.println(spans.endPosition());
}

twiiter4j when to STOP when no more tweets available?

So, I've figured out how to be able to get more than 100 tweets, thanks to How to retrieve more than 100 results using Twitter4j
However, when do I make the script stop and print stop when maximum results have been reached? For example, I set
int numberOfTweets = 512;
And, it finds just 82 tweets matching my query.
However, because of:
while (tweets.size () < numberOfTweets)
it still continues to keep on querying over and over until I max out my rate limit of 180 requests per 15 seconds.
I'm really a novice at java, so I would really appreciate if you could show me how to resolve this by modifying the first answer script at How to retrieve more than 100 results using Twitter4j
Thanks in advance!

You only need to modify things in the try{} block. One solution is to check whether the ID of the last tweet you found on the previous loop(previousLastID) in the while is the same as the ID of the last tweet (lastID) in the new batch collected (newTweets). If it is, it means the new batch's elements already exist in the previous array, and that that we have reached the end of possible tweets for this hastag.
try {
QueryResult result = twitter.search(query);
List<Status> newTweets = result.getTweets();
long previousLastID = lastID;
for (Status t: newTweets)
if (t.getId() < lastID) lastID = t.getId();
if (previousLastID == lastID) {
println("Last batch (" + tweets.size() + " tweets) was the same as first. Stopping the Gathering process");
break;
}

Paging Lucene's search results

I am using Lucene to show search results in a web application.I am also custom paging for showing the same.
Search results could vary from 5000 to 10000 or more.
Can someone please tell me the best strategy for paging and caching the search results?

I would recommend you don't cache the results, at least not at the application level. Running Lucene on a box with lots of memory that the operating system can use for its file cache will help though.
Just repeat the search with a different offset for each page. Caching introduces statefulness that, in the end, undermines performance. We have hundreds of concurrent users searching an index of over 40 million documents. Searches complete in much less than one second without using explicit caching.
Using the Hits object returned from search, you can access the documents for a page like this:
Hits hits = searcher.search(query);
int offset = page * recordsPerPage;
int count = Math.min(hits.length() - offset, recordsPerPage);
for (int i = 0; i < count; ++i) {
Document doc = hits.doc(offset + i);
...
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene: How to restrict the number of hits? - lucene

Related

Lucene calculate term vectors for existing index

With Lucene, determine which name in a SQL Table exists in a document

How to get the matching spans of a Span Term Query in Lucene 5?

twiiter4j when to STOP when no more tweets available?

Paging Lucene's search results

Categories

Resources