How to deal with pagination on Get requests? - api

I'm creating a small script for automation, but I ran into a problem.
Suppose I use the Get API method to get results, then I want to add one to each result.
The assumption itself is not difficult:
def numbers = get(endpoint)
numbers.each{int number ->
log.info(number+1)
}
I am having a hard time, however, figuring out the correct approach to pagination. The response limit for a query is 100. Before submitting a query, I don't know how many responses to expect. (There might be more than 100 and then I have to use pagination).
In this case, should I first determine how many results I can get, and with this knowledge only then create for loops for each "page"?
Or should I try with a while loop, and continue sending GET request, until total quantity would be < 100?
Something like:
bool continue = true;
int startAt = 0;
while(continue){
def numbers = get(endpoint)
.queryString('startAt', startAt)
number.each{int number ->
log.info(number+1)
}
startAt += 100;
if(numbers.total == 100) continue = false;
}
For now I was using for loop, but had two different endpoints. One endpoint was showing me max results, second one details for each result. But the second one was limited to 100 results, so I counted how many loops I need by dividing total results by 100.

Related

How to implement a time based length queue in F#?

This is a followup to question: How to optimize this moving average calculation, in F#
To summarize the original question: I need to make a moving average of a set of data I collect; each data point has a timestamp and I need to process data up to a certain timestamp.
This means that I have a list of variable size to average.
The original question has the implementation as a queue where elements gets added and eventually removed as they get too old.
But, in the end, iterating through a queue to make the average is slow.
Originally the bulk of the CPU time was spent finding the data to average, but then once this problem was removed by only keeping the data needed in the first place, the Seq.average call proved to be very slow.
It looks like the original mechanism (based on Queue<>) is not appropriate and this question is about finding a new one.
I can think of two solutions:
implement this as a circular buffer which is large enough to accommodate the worst case scenario, this would allow to use an array and do only two iterations to make the sum.
quantize the data in buckets and pre-sum it, but I'm not sure if the extra complexity will help performance.
Is there any implementation of a circular buffer that would behave similarly to a Queue<>?
The fastest code, so far, is:
module PriceMovingAverage =
// moving average queue
let private timeQueue = Queue<DateTime>()
let private priceQueue = Queue<float>()
// update moving average
let updateMovingAverage (tradeData: TradeData) priceBasePeriod =
// add the new price
timeQueue.Enqueue(tradeData.Timestamp)
priceQueue.Enqueue(float tradeData.Price)
// remove the items older than the price base period
let removeOlderThan = tradeData.Timestamp - priceBasePeriod
let rec dequeueLoop () =
let p = timeQueue.Peek()
if p < removeOlderThan then
timeQueue.Dequeue() |> ignore
priceQueue.Dequeue() |> ignore
dequeueLoop()
dequeueLoop()
// get the moving average
let getPrice () =
try
Some (
priceQueue
|> Seq.average <- all CPU time goes here
|> decimal
)
with _ ->
None
Based on a queue length of 10-15k I'd say there's definitely scope to consider batching trades into precomputed blocks of maybe around 100 trades.
Add a few types:
type TradeBlock = {
data: TradeData array
startTime: DateTime
endTime: DateTime
sum: float
count:int
}
type AvgTradeData =
| Trade of TradeData
| Block of TradeBlock
I'd then make the moving average use a DList<AvgTradeData>. (https://fsprojects.github.io/FSharpx.Collections/reference/fsharpx-collections-dlist-1.html) The first element in the DList is summed manually if startTime is after the price period and removed from the list once the price period exceeds the endTime. The last elements in the list are kept as Trade tradeData until 100 are appended and then all removed from the tail and turned into a TradeBlock.

Improve time of count function

I am new to Kotlin (and Java). In order to pick up on the language I am trying to solve some problems from a website.
The problem is quite easy and straightfoward, the function has to count how many times the biggest value is included in an IntArray. My function also works for smaller arrays but seems to exceed the allowed time limit for larger ones (error: Your code did not execute within the time limits).
fun problem(inputArray: Array<Int>): Int {
// Write your code here
val n: Int = inputArray.count{it == inputArray.max()}
return n
}
So as I am trying to improve I am not looking for a faster solution, but for some hints on topics I could look at in order to find a faster solution myself.
Thanks a lot!
In an unordered array you to touch every element to calcuate inputArray.max(). So inputArray.count() goes over all elements and calls max() that goes over all elements.
So runtime goes up n^2 for n elements.
Store inputArray.max() in an extra variable, and you have a linear runtime.
val max = inputArray.max()
val n: Int = inputArray.count{ it == max }

Getting Term Frequencies For Query

In Lucene, a query can be composed of many sub-queries. (such as TermQuery objects)
I'd like a way to iterate over the documents returned by a search, and for each document, to then iterate over the sub-queries.
For each sub-query, I'd like to get the number of times it matched. (I'm also interested in the fieldNorm, etc.)
I can get access to that data by using indexSearcher.explain, but that feels quite hacky because I would then need to parse the "description" member of each nested Explanation object to try and find the term frequency, etc. (also, calling "explain" is very slow, so I'm hoping for a faster approach)
The context here is that I'd like to experiment with re-ranking Lucene's top N search results, and to do that it's obviously helpful to extract as many "features" as possible about the matches.
Via looking at the source code for classes like TermQuery, the following appears to be a basic approach:
// For each document... (scoreDoc.doc is an integer)
Weight weight = weightCache.get(query);
if (weight == null)
{
weight = query.createWeight(indexSearcher, true);
weightCache.put(query, weight);
}
IndexReaderContext context = indexReader.getContext();
List<LeafReaderContext> leafContexts = context.leaves();
int n = ReaderUtil.subIndex(scoreDoc.doc, leafContexts);
LeafReaderContext leafReaderContext = leafContexts.get(n);
Scorer scorer = weight.scorer(leafReaderContext);
int deBasedDoc = scoreDoc.doc - leafReaderContext.docBase;
int thisDoc = scorer.iterator().advance(deBasedDoc);
float freq = 0;
if (thisDoc == deBasedDoc)
{
freq = scorer.freq();
}
The 'weightCache' is of type Map and is useful so that you don't have to re-create the Weight object for every document you process. (otherwise, the code runs about 10x slower)
Is this approximately what I should be doing? Are there any obvious ways to make this run faster? (it takes approx 2 ms for 280 documents, as compared to about 1 ms to perform the query itself)
Another challenge with this approach is that it requires code to navigate through your Query object to try and find the sub-queries. For example, if it's a BooleanQuery, you call query.clauses() and recurse on them to look for all leaf TermQuery objects, etc. Not sure if there is a more elegant / less brittle way to do that.

Optimal Solution: Get a random sample of items from a data set

So I recently had this as an interview question and I was wondering what the optimal solution would be. Code is in Objective-c.
Say we have a very large data set, and we want to get a random sample
of items from it for testing a new tool. Rather than worry about the
specifics of accessing things, let's assume the system provides these
things:
// Return a random number from the set 0, 1, 2, ..., n-2, n-1.
int Rand(int n);
// Interface to implementations other people write.
#interface Dataset : NSObject
// YES when there is no more data.
- (BOOL)endOfData;
// Get the next element and move forward.
- (NSString*)getNext;
#end
// This function reads elements from |input| until the end, and
// returns an array of |k| randomly-selected elements.
- (NSArray*)getSamples:(unsigned)k from:(Dataset*)input
{
// Describe how this works.
}
Edit: So you are supposed to randomly select items from a given array. So if k = 5, then I would want to randomly select 5 elements from the dataset and return an array of those items. Each element in the dataset has to have an equal chance of getting selected.
This seems like a good time to use Reservoir Sampling. The following is an Objective-C adaptation for this use case:
NSMutableArray* result = [[NSMutableArray alloc] initWithCapacity:k];
int i,j;
for (i = 0; i < k; i++) {
[result setObject:[input getNext] atIndexedSubscript:i];
}
for (i = k; ![input endOfData]; i++) {
j = Rand(i);
NSString* next = [input getNext];
if (j < k) {
[result setObject:next atIndexedSubscript:j];
}
}
return result;
The code above is not the most efficient reservoir sampling algorithm because it generates a random number for every entry of the reservoir past the entry at index k. Slightly more complex algorithms exist under the general category "reservoir sampling". This is an interesting read on an algorithm named "Algorithm Z". I would be curious if people find newer literature on reservoir sampling, too, because this article was published in 1985.
Interessting question, but as there is no count or similar method in DataSet and you are not allowed to iterate more than once, i can only come up with this solution to get good random samples (no k > Datasize handling):
- (NSArray *)getSamples:(unsigned)k from:(Dataset*)input {
NSMutableArray *source = [[NSMutableArray alloc] init];
while(![input endOfData]) {
[source addObject:[input getNext]];
}
NSMutableArray *ret = [[NSMutableArray alloc] initWithCapacity:k];
int count = [source count];
while ([ret count] < k) {
int index = Rand(count);
[ret addObject:[source objectAtIndex:index]];
[source removeObjectAtIndex:index];
count--;
}
return ret;
}
This is not the answer I did in the interview but here is what I wish I had done:
Store pointer to first element in dataset
Loop over dataset to get count
Reset dataset to point at first element
Create NSMutableDictionary for storing random indexes
Do for loop from i=0 to i=k. Each iteration, generate a random value, check if value exists in dictionary. If it does, keep generating a random value until you get a fresh value.
Loop over dataset. If the current index is within the dictionary, add it to a the array of random subset values.
Return array of random subsets.
There are multiple ways to do this, the first way:
1. use input parameter k to dynamically allocate an array of numbers
unsigned * numsArray = (unsigned *)malloc(sizeof(unsigned) * k);
2. run a loop that gets k random numbers and stores them into the numsArray (must be careful here to check each new random to see if we have gotten it before, and if we have, get another random, etc...)
3. sort numsArray
4. run a loop beginning at the beginning of DataSet with your own incrementing counter dataCount and another counter numsCount both beginning at 0. whenever dataCount is equal to numsArray[numsCount], grab the current data object and add it to your newly created random list then increment numsCount.
5. The loop in step 4 can end when either numsCount > k or when dataCount reaches the end of the dataset.
6. The only other step that may need to be added here is before any of this to use the next command of the object type to count how large the dataset is to be able to bound your random numbers and check to make sure k is less than or equal to that.
The 2nd way to do this would be to run through the actual list MULTIPLE times.
// one must assume that once we get to the end, we can start over within the set again
1. run a while loop that checks for endOfData
a. count up a count variable that is initialized to 0
2. run a loop from 0 through k-1
a. generate a random number that you constrain to the list size
b. run a loop that moves through the dataset until it hits the rand element
c. compare that element with all other elements in your new list to make sure it isnt already in your new list
d. store the element into your new list
there may be ways to speed up the 2nd method by storing a current list location, that way if you generate a random that is past the current pointer you dont have to move through the whole list again to get back to element 0, then to the element you wish to retreive.
A potential 3rd way to do this might be to:
1. run a loop from 0 through k-1
a. generate a random
b. use the generated random as a skip count, move skip count objects through the list
c. store the current item from the list into your new list
Problem with this 3rd method is without knowing how big the list is, you dont know how to constrain the random skip count. Further, even if you did, chances are that it wouldnt truly look like a randomly grabbed subset that could easily reach the last element in the list as it would become statistically unlikely that you would ever reach the end element (i.e. not every element is given an equal shot of being select.)
Arguably the FASTEST way to do this is method 1, where you generate the random numerics first, then traverse the list only once (yes its actually twice, once to get the size of the dataset list then again to grab the random elements)
We need a little probability theory. As others, I will ignore the case n < k. The probability that the n'th item will be selected into the set of size k is just C(n-1, k-1) / C(n, k) where C is the binomial coefficient. A bit of math says shows that this is just k/n. For the rest, note that the selection of the n'th item is independent of all other selections. In other words, "the past doesn't matter."
So an algorithm is:
S = set of up to k elements
n = 0
while not end of input
v = next value
n = n + 1
if |S| < k add v to S
else if random(0,1) >= k/n replace a randomly chosen element of S with v
I will let the coders code this one! It's pretty trivial. All you need is an array of size k and one pass over the data.
If you care about efficiency (as your tags suggest) and the number of items in the population is known, don't use reservior sampling. That would require you to loop through the entire data set and generate a random number for each.
Instead, pick five values ranges from 0 to n-1. In the unlikely case, there is a duplicate among the five indexes, replace the duplicate with another random value. Then use the five indexes to do a random-access lookup to the i-th element in the population.
This is simple. It uses a minimum number of calls the random number generator. And it accesses memory only for the relevant selections.
If you don't know the number of data elements in advance, you can loop-over the data once to get the population size and proceed as above.
If you aren't allow to iterate over the data more than once, use a chunked form of reservior sampling: 1) Choose the first five elements as the initial sample, each having a probability of 1/5th. 2) Read in a large chunk of data and choose five new samples from the new set (using only five calls to Rand). 3) Pairwise, decide whether to keep the new sample item or old sample element (with odds proportional the the probablities for each of the two sample groups). 4) Repeat until all the data has been read.
For example, assume there are 1000 data elements (but we don't know this in advance).
Choose the first five as the initial sample: current_sample = read(5); population=5.
Read a chunk of n datapoints (perhaps n=200 in this example):
subpop = read(200);
m = len(subpop);
new_sample = choose(5, subpop);
loop-over the two samples pairwise:
for (a, b) in (current_sample and new_sample): if random(0 to population + m) < population, then keep a, otherwise keep *b)
population += m
repeat

Groovy Sql Paging Behavior

I am using the groovy.sql.Sql class to query a database and process the results. My problem is that the ResultSet can be very large; so large that I risk running out of memory if I try to process the whole ResultSet at once. I know the Sql.rows() method supports paging using offset and max results parameters but I haven't been able to find a good example of how to use it (and I'm not certain that paging is what I'm looking for).
Basically, here's what I'm trying to do:
def endOfResultSet = false
for(int x = 1; !endOfResultSet; x+=1000){
def result = sql.rows("Select * from table", x, 1000)
processResult(result)
endOfResultSet = result.size()!=1000
}
My question is if Groovy is smart enough to reuse the same result set for the sql.rows("Select * from table", x, 1000) call or if it will be repeatedly be running the same statement on the database and then paging to where the offset starts.
Your help is appreciated, Thanks!
Edit: What I'm trying to avoid is running the same query on the database multiple times. I'd like to run the query once, get the first 1,000 rows, process them, get the next 1,000 rows, etc... until all the rows are processed.
I assume you've seen this blog post about paging?
To answer your question, if we look at the code for the Sql class in Groovy, we can see that the code for rows(String,int,int) calls rows(String,int,int,null)
And the code for that is:
AbstractQueryCommand command = createQueryCommand(sql);
ResultSet rs = null;
try {
rs = command.execute();
List<GroovyRowResult> result = asList(sql, rs, offset, maxRows, metaClosure);
rs = null;
return result;
} finally {
command.closeResources(rs);
}
So as you can see, it gets the full ResultSet, then steps through this inside the asList method, filling a List<GroovyRowResult> object with just the results you requested.
Edit (after the question was edited)
As I said in my comment below, I think you're going to need to write your own paging query for the specific database you are using... For example, with MySQL, your above query can be changed to:
def result = sql.rows( "SELECT * FROM table LIMIT ${Sql.expand x}, 1000" )
Other databases will have different methods for this sort of thing...I don't believe there is a standard implementation
Answer from above is not correct. If you dig deeper, you'll find that if the ResultSet is not TYPE_FORWARD_ONLY, then the "absolute" method of the ResultSet is invoked to position a server side cursor. Then maxRows are returned. If the ResultSet is TYPE_FORWARD_ONLY, then ResultSet.next() is invoked offset number of times, then maxRows are returned. The exact performance characteristics will depend on the underlying jdbc driver implementation, but usually you want a scrollable result set when using the paging feature.
The resultset is not reused between invocations. Sounds like you want something like streaming, not paging.
Also, I wrote the patch, btw.
http://jira.codehaus.org/browse/GROOVY-4622