Accessing the paging data from a native script in ElasticSearch - scripting

I am currently using a native script written in Java to filter the search results based on various forms of access control. The problem is that the access control verification takes a ridiculous amount of time per-record. There are some ways we can improve somewhat, but we came up with a work-around that will improve it drastically. The only problem is that I'm not sure if I can do it the way I want.
The solution: I need to stop assessing the access controls after the relevant number of results have been found.
The problem: I can't figure out how to access the offset and page size from within the script (implementing AbstractSearchScript at the moment) in order to decide when I've reached my minimum results. Does anyone have any idea how to get this data "properly" without making it a separate parameter of the script?
The bonus: I need to return a number of hits that is close to or larger than the actual number of hits. Since elasticsearch doesn't cache the results of the query, I can work around the problem by simply returning true for every result past the relevant ones. But I'd like to work a solution closer to Google's where I return a number of remaining results based on what percentage of the data was a hit so far. However, to do this (and to avoid potential complications) I'd like to just modifying the hits data directly. Is there any way to do this from a script?

You should probably use the "terminate_after" parameter introduced in ES 1.4, rather than trying to implement this yourself.
For your "bonus," I would just run the query a second time without the ACLs.

Related

How to build an efficient query string in ASP.NET Core?

Hello World,
I'm in research mode for one of feature to be built in our software and there one new thing that we have never faced.
The thing is, on one form we have a drop down with list of items. User can select default which means all items needs to be considered or else he can selectively opt for certain list items.
Actually the form is related to filter functionality depending upon users input the data is going to get filtered and displayed on UI.
The main problem we are trying to solve is suppose user selects default, which means all list items ID's are gonna be considered in POST call of API. The list can be huge, say 1 to 1K and above too.
So under such circumstances we can build the query string but, it seems its gonna be so huge. I have also studied that certain browsers support limited query string as per their standard limits.
So currently I have following doubts in mind.
Will shortening of query string work here ?
By which technique it can be handled efficiently ?
What performance considerations I need to take care during during so ?
Any suggestions or thoughts are welcome. That would boast my software design thinking.
based on what I understand from your question, here is my opinion:[if I understood wrong, please correct me, so I can help you]
You need to send query in URL and not in body or using JSON!is that correct?
I think you don't need to send every one of the selected items one by one!
If there are selected in serial, you can perform a range in your query!
Like http://abcd/test?id=1-43,6-765(take ID as string and then export the useful data in back-end) with this approach, you can shorten your query!
And also think about the database too (if there is any).querying this much data is use a lot of IO and make query low performance.

PDFBOX - Unknown number of pages

I am investigating a replacement for iText and have been looking at the API and example code for PdfBox. I am slightly confused by its useage though, it seems I need to manually create the page objects which implies I need to know the number of pages beforehand or at least work out when its time to create a new page.
I generally use PDF generation for reports based on user configurable parameters which call stored procedures which can return varying amounts of data.
My question is quite simple, is it down to me to try and work out how much data will fit onto a page and create the pages programmatically?
The API seems to state that each page object represents a single page. From my experience of iText I do not need to worry about this, I simply write my data to the document and the pages are created for me based on the content I am placing into it.
I recently made the switch from iText to PDFBox and ran into a similar issue. I asked this question and eventually worked out what I needed to do to generate reports with an unknown number of pages.
This model works well for generating reports containing lines of data generated from a ResultSet...though that's the only way I've been using it thus far. I may run into limitations, but for now, it's getting the job done.
And I guess I should state that I am still laying out each page manually, but this method does at least generate my pages dynamically depending on the number of results that return.

setFetchBatchSize doesn't seem to work properly

I ask a few questions on this topic but still can't get it to work. I have a core data with 10k+ rows of people names that i am showing in a tableview. I would like to be able to search and update the table with every letter. It's very laggy. As suggested i watch the WWWDC '10 core data presentation and was trying to implement
[request setFetchBatchSize:50];
Doesn't seem to work. When i use instruments to check core data it still shows that there is still 10k request when loading the tableview and when i search it also gets all the results.
Is there anything else that needs to be done to set the batch size or thats not something that will help me.
The only thing that seems to work is setting the fetchlimit to 100 when i search. Do you think its a good solution?
Thanks in advance!
The batch size just tells it how many objects to fetch at a time. This is probably not going to help you very much. Let's consider your use case a bit...
The user types "F" and you tell the database, "Go find all the names that start with 'F'" and the database looks at all 10k+ records to find the ones that start with 'F'
Then, the user types 'r', so you tell the database to go find all the records that start with "Fr" and it again looks at all 10k+ records to find the ones that start with "Fr."
All fetchBatchSize is doing is telling it "Hey, when you fault in a record, bring in 50 at once because I'm going to probably need all those anyway." That does nothing to limit your search.
However, setting fetchLimit to 100 helps some because the database starts hunting through all 10k+ records, but once it has its 100 records, it does not have to keep looking at the rest of the records because it already has filled its request. It's done, and stops searching as soon as it gets 100 records that satisfy the request.
So, there are several things you can do, all depending on your other use cases.
The easiest thing to try is adding an index on the field that you are searching. You can set that in the Xcode model editor (section that says Indexes, right under where you can name the entity in the inspector). This will allow the database to setup a special index on that field, and searching will be much faster.
Second, after your initial request, you already have an array of names that begin with 'F' so there is no need to go back to the database to ask for names that begin with 'Fr' If it begins with 'Fr' it also begins with 'F' and you already have NSManagedObject pointers for all of those. Now, you can just search the array you got back.
Even better, if you gave it a sort descriptor, the array is sorted. Thus, you can do a simple binary search on the array. Or, if you prefer, you can just use the same predicate, and apply it to the results array instead of the database.
Even if you don't use the results-pruning I just discussed, I think indexing the attribute will belt dramatically.
EDIT
Maybe you should run instruments to see how much time you are spending where. Also, a badly formed predicate can bring any index scheme to it knees. Code would help.
Finally, consider how many elements you are bringing into memory. CoreData does not fault all the information in, but it does create shells for everything in the array.
If you give it a sort predicate,
I don't know how SQLLite implements its search on an index, but a B-Tree has complexity logBN so even on 30k records, that's not a lot of searching. Unless you have another problem, the indexing should have given you a pretty big improvement.
Once you have the index, you should not be examining all records. However, you still may have a very large set of data. Try fetchBatchSize on those, because it will limit the number of records fetched, and create proxies for the rest.
You can also call countFetchRequext instead of executeFetchRequest to get the number of items. Then, you can use fetchLimit to restrict the number you get.
As far as working all this with a fetched results controller... well, that guy has to know the records, so it still has to do the search.
Also, one place to look... are you doing sections? If you have a user defined comparator for anything (like translating for sections) this will get called for every single record.
Thus, I guess my big suggestion, after making the index change, is to run instruments and really study it to determine where you are spending your time. It should be pretty obvious. That will help steer you toward the real issue.
My bet is that you are still access all of the elements for some reason...

Staggering sqlite records with Objective C

An iphone app we are producing will potentially have a table with 100,000+ records (scaling wise) and I am concerned that loading all this data via a SELECT * command into an array, or an Object will either make the app very slow; or will leave me with memory issues later on.
Obviously loading 100,000+ records into an array when the viewscreen/viewport only shows 30 odd records in a go is a tad silly.
In addition, I am not sure if storing all this data in an object is the right thing to do either.
So my question is, is it possible to stagger sqlite through records, say 50 records in cache and then when you swipe down or up it loads an appropriate amount into the cache. I guess its similar to the JQuery Lazy Loading library where it only loads a little bit on the view port and then loads more as you move down.
I'm looking at JSON, but it appears to only be for Web services as it requires a URL and I am not sure if it works for files that are on the phone.
Therefore. Is there a proper way to load sqlite data into Objective C arrays/objects without causing problems when the data suddenly starts to scale?
Thanks for your help.
You definitely want to avoid loading in everything at once. Instead, you want to use a cursor and smarter queries so that you only pull data out of the database as you actually need it for display; that link points to some hints on how to do that.
Generally, you also want to avoid blindly selecting all columns. This is because you may sometime update the schema to add a column that your existing code would not always know how to extract; far better to name the explicitly columns that you expect and desire.

Can Parallel.ForEach be used safely with CloudTableQuery

I have a reasonable number of records in an Azure Table that I'm attempting to do some one time data encryption on. I thought that I could speed things up by using a Parallel.ForEach. Also because there are more than 1K records and I don't want to mess around with continuation tokens myself I'm using a CloudTableQuery to get my enumerator.
My problem is that some of my records have been double encrypted and I realised that I'm not sure how thread safe the enumerator returned by CloudTableQuery.Execute() is. Has anyone else out there had any experience with this combination?
I would be willing to bet the answer to Execute returning a thread-safe IEnumerator implementation is highly unlikely. That said, this sounds like yet another case for the producer-consumer pattern.
In your specific scenario I would have the original thread that called Execute read the results off sequentially and stuff them into a BlockingCollection<T>. Before you start doing that though, you want to start a separate Task that will control the consumption of those items using Parallel::ForEach. Now, you will probably also want to look into using the GetConsumingPartitioner method of the ParallelExtensions library in order to be most efficient since the default partitioner will create more overhead than you want in this case. You can read more about this from this blog post.
An added bonus of using BlockingCollection<T> over a raw ConcurrentQueueu<T> is that it offers the ability to set bounds which can help block the producer from adding more items to the collection than the consumers can keep up with. You will of course need to do some performance testing to find the sweet spot for your application.
Despite my best efforts I've been unable to replicate my original problem. My conclusion is therefore that it is perfectly OK to use Parallel.ForEach loops with CloudTableQuery.Execute().