Paging Lucene's search results - lucene

I am using Lucene to show search results in a web application.I am also custom paging for showing the same.
Search results could vary from 5000 to 10000 or more.
Can someone please tell me the best strategy for paging and caching the search results?

I would recommend you don't cache the results, at least not at the application level. Running Lucene on a box with lots of memory that the operating system can use for its file cache will help though.
Just repeat the search with a different offset for each page. Caching introduces statefulness that, in the end, undermines performance. We have hundreds of concurrent users searching an index of over 40 million documents. Searches complete in much less than one second without using explicit caching.
Using the Hits object returned from search, you can access the documents for a page like this:
Hits hits = searcher.search(query);
int offset = page * recordsPerPage;
int count = Math.min(hits.length() - offset, recordsPerPage);
for (int i = 0; i < count; ++i) {
Document doc = hits.doc(offset + i);
...
}

Related

Read very slow from a database

I use spring boot with spring data jpa, hibernate and oracle.
Actually, I my table I have arount 10 millions of record, I need to do some operation, write info to a file and after delete the record.
It's a basic sql query
select * from zzz where status = 2;
I done a test without doing operation and delete record
long start = System.nanoTime();
int page = 0;
Pageable pageable = PageRequest.of(page, LIMIT);
Page<Billing> pageBilling = billingRepository.findAllByStatus(pageable);
while (true) {
for (Billing: pageBilling .getContent()) {
//process
//write to file
//delete element
}
if (!pageBilling .hasNext()) {
break;
}
pageable = pageBilling .nextPageable();
pageBilling = billingRepository.findAllByStatus(pageable);
}
long end = System.nanoTime();
long microseconds = (end - start) / 1000;
System.out.println(microseconds + " to write");
Result it's bad, with a limit of 10 000, that took 157 minutes, with 100 000 28 minutes, with millions 19 minutes.
It's there a better solution to increase performance?
The following are likely to improve the performance significantly:
You should not iterate past the first page. Instead, delete the processed data and select the first page again. Actually you don't need a page for that you can encode the limit in the method name. Selecting late pages is rather inefficient.
The process of loading, processing and deleting one batch of items should be in a separate transaction. Otherwise the EntityManager will hold all the entities ever loaded which will make things really slow.
If that still isn't sufficient yet you may look into the following:
Inspect the SQL executed. Does it look sensible? If not consider switching to JdbcTemplate or NamedParameterJdbcTemplate with a query method that takes a RowCallbackHandler you should be able to load and process all rows with a single select statement and at the end to process one delete statement to remove all rows. This requires that the status that you use for filtering does not change in the mean time.
How do the execution plans look like? If they seem of inspect your indices.

Using deepstream List for tens of thousands unique values

I wonder if it's a good/bad idea to use deepstream record.getList for storing a lot of unique values, for example, emails or any other unique identifiers. The main purpose is to be able to answer a question quickly whether we already have, say, a user with such email (email in use) or another record by specific unique field.
I made few experiments today and got two problems:
1) when I tried to populate the list with few thousands values I got
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
and my deepstream server went off. I was able to fix it by adding more memory to the server node process with this flag
--max-old-space-size=5120
it doesn't look fine but allowed me to make a list with more than 5000 items.
2) It wasn't enough for my tests so I precreated the list with 50000 items and put the data directly to rethinkdb table and got another issue on getting the list or modifing it:
RangeError: Maximum call stack size exceeded
I was able to fix it with another flag:
--stack-size=20000
It helps but I believe it's only matter of time when one of those errors appear in production when the list size reaches proper value. I don't know really whether it's nodejs, javascript, deepstream or rethinkdb issue. That's all in general made me think that I try to use deepstream List wrong way. Please, let me know. Thank you in advance!
Whilst you can use lists to store arrays of strings, they are actually intended as collections of recordnames - the actual data would be stored in the record itself, the list would only manage the order of the records.
Having said that, there are two open Github issues to improve performance for very long lists by sending more efficient deltas and by introducing a pagination option
Interesting results in regards to memory though, definitely something that needs to be handled more gracefully. In the meantime you could drastically improve performance by combining updates into one:
var myList = ds.record.getList( 'super-long-list' );
// Sends 10.000 messages
for( var i = 0; i < 10000; i++ ) {
myList.addEntry( 'something-' + i );
}
// Sends 1 message
var entries = [];
for( var i = 0; i < 10000; i++ ) {
entries.push( 'something-' + i );
}
myList.setEntries( entries );

Should paging be zero indexed within an API?

When implementing a Rest API, with parameters for paging, should paging be zero indexed or start at 1. The parameters would be Page, and PageSize.
For me, it makes sense to start at 1, since we are talking about pages
There's no standard for it. Just have a look around: there are hundreds of thousands of APIs using different approaches.
Most of APIs I know use one of the following approaches for pagination:
offset and limit or
page and size
Both can be 0 or 1 indexed. Which is better? That's up to you.
Just pick the one that fits your needs and document it properly.
Additionally, you could provide some links in the response payload to make the navigation easier between the pages.
Consider, for example, you are reading data from page 2. So, provide a link for the previous page (page 1) and for the next page (page 3):
{
"data": [
...
],
"paging": {
"previous": "http://api.example.com/foo?page=1&size=10",
"next": "http://api.example.com/foo?page=3&size=10"
}
}
And remember, always make an API you would love to use.
True, there's no standard for this.
I find that Microsoft based products used (old ones like DAO for Visual Basic 6, Visual C++ 6 and similar products) to start their pagination from 1, but a lot of other tech stacks uses 0. Gradually I find that more and more libraries are using 0 instead of 1.
Why is this? It's because, mathematically speaking, it's easier to map pageIndex starting from 0 to rowNumber in DB or Array. Suppose you have a dataset fetched from a Table in DB with 100 records. Now you want to send the second page (pageSize = 10 for example). With pageIndex starting from 0, then you only need to write
startRowNumber = pageIndex * pageSize;
return dataSet[startRowNumber, startRowNumber + pageSize]
Because in most DBs and languages, arrays/lists are 0-indexed. And even if your Rest API language uses a 1-indexed array, you would still have a problem when mapping a 1-indexed pageIndex to recordIds. For example: Suppose you have a dataset indexed 1..100 (not 0..99), and you want to send the 11th to 20th records, as the second page (here pageSize=10 and pageIndex=2, because in your case you start with 1). This means you need use the formula
((pageIndex - 1) * pageSize) + 1 ; // to get the number 11.
You see that it's easier to have a 0-indexed paging for developers.
1-indexed pagination makes more sense to human users, because we start with 1 when counting everything.

read all document by using particular category name using alfresco search.luceneSearch or search.lib.js

Category Name
|
Geograpy (8)
Study Db (18)
i am implement my own advance search in alfresco. i need to read all files which related with particular category.
example:
if there is 20 file under geograpy, lucene query should read particular document under search key word "banana".
Further explanation -
I am using search.lib.js to search. I would like to analyze the result to find out to which category the documents belong to. For example I would like to know how many documents belong to the category under Languages and the subcategories. I experimented with the Classification API but I don't get the result I want. Any Idea how to go through the result to get the category name of each document?
is there any simple method like node.properties["cm:creator"]?
thanks
janaka
I think you should specify more your question:
Are you using cm:content or a customized content?
Are you going to search the keyword inside the content of the file? or are you going to search the keyword in a specific metadata(s)?
Do you want to create a webscript (java or javascript)?
One thing to take in consideration:
if you use +PATH:"cm:generalclassifiable/...." for the categorization in your lucene queries, the performance will be slow (following my experince)
You can use for example the next query to find all nodes at any depth below /cm:Languages:
var results = search.luceneSearch("+PATH:\"cm:generalclassifiable/cm:Languages//*\");
Take a look to this url: https://wiki.alfresco.com/wiki/Search#Path_Queries
Once you have all the elements, you can loop all, and get to which category below. Of course you need to create some counter per each category/subcategory:
for(i = 0; i < results.length; i++){
var node = results[i];
var categoryNodeRef = node.properties["cm:categories"];
var categoryDesc = categoryNodeRef.properties["cm:description"];
var categoryName = categoryNodeRef.properties["cm:name"];
}
This is not exactly the solution, but can be a useful idea to start.
Sorry if it's not what you're asking for, I have just arrived from my holidays.

Lucene: How to restrict the number of hits?

I have a webpage form which carries out a search of all the photos that users have uploaded to the website. The problem is that the Lucene search is currently retrieving all photos that meet the search criteria even though we are only displaying 21 photos on the page. This is causing serious performance issues. Is it possible to limit the number of photos retrieved to 21, in order to improve performance?
In the same way that we can restrict searches to a specific category by using eg (Category: New), is there a similar way to restrict the number of hits?
This is what I do:
The search method has number of results as a parameter. I pass pageSize*page.
So for page 1, I get only pageSize documents.
Then I only load the document (using searcher.doc()) for the page that I need.
TopDocs hits = searcher.search(lucene_query, pageSize*(page));
ScoreDoc[] scoreDocs = hits.scoreDocs;
int j = startIndex;
int rem = 0;
while (j < scoreDocs.length && (endIndex==0 || j<endIndex)) {
ScoreDoc sd = scoreDocs[j];
Document d = searcher.doc(sd.doc);
}