how to get next 1000 records the fastest way - azure-storage

I'm using Azure Table Storage.
Let's say i have a Partition in my Table with 10,000 records, and I would like to get records number 1000 to 1999. And next time i would like to get records number 4000 to 4999 etc.
What is the fastest way of doing that?
All I can find till now are two options, which I don't like very much:
1. run a query which returns all 10,000 records, and filter out what I want when I get all 10,000 records.
2. Run a query whichs returns 1000 records at a time, and use a continuation token to get the next 1000 records.
Is it possible to get a continuation token without downloading all corresponding records? It would be great if i can get Continuation Token 1, than get Continuation token 2, and with CT2 get records 2000 to 2999.

Theoretically you should be able to use continuation tokens without downloading the actual data for the first 1000 recors by closing the connection you have after the first request. And I mean closing it at TCP level. And before you read all data. Then open a new connection and use continuation token there. Two WebRequests will not do it since the HTTP implementation will likely use keep alive wchich means all your data is going to be read in the background even though you don't read it in your code. Actually you can configure your HTTP requests to not use keep alive.
However, another way is naturally if you know the RowKey and can search on that but I assume you don't know which row keys will be in each 1000 entity batch.
Last I would ask why you have this problem in the first place. And what your access pattern is. If inserts are common and getting these records is rare I wouldn't bother making it more efficient. if this is like a paging problem i would probably get all data on the first request and cache it (in the cloud). if inserts are rare but you need to run this query often I would consider making the insertion of data have one partion for every 1000 entities and rebalance as needed (due to sorting) as entities are inserted.

Related

Listing BigQuery Tables in `huge/big` Datasets - 30K-40K+ tables

The task is to programmatically list all the tables within the given dataset with more than 30-40K tables
Initial option we explored was using tables.list API (as we do all the times for normal datasets with reasonable number of tables in them)
Looks like this API returns max 1000 entries (even if we try to set maxResults to bigger value)
To take next 1000 we need to “wait” for response of previous request then extract pageToken and repeat call and so on
For the datasets with 30K – 40K+ this can take up to 10-15 and more sec (under good weather)
So the timing is a problem for us that we want to address!
In above mentioned calls we are getting back only nextPageToken and tables/tableReference/tableId so size of response is extremely small!
Question:
Is there way to somehow increase maxResults, so to get all tables in one (or very few) call(s) (assuming it will be much faster than doing 30-40 calls)?
The workaround we tried so far is to use __TABLES_SUMMARY__ with jobs.insert or jobs.query API.
This way – the whole result is returned within the seconds – but in our particular case – using BigQuery jobs API is not an option for multiple reasons. We want to be able to use list API

BigQuery-Java: difference between QueryResponse and GetQueryResultsResponse

In sample code provided by Google, 2 classes are used to fetch results. QueryResponse and GetQueryResultsResponse.
I am not able to understand purpose of these 2 classes and do we have to use these 2 classes?
We are getting data from both: queryResponse.getRows() and queryResults.getRows()
I have gone through docs but could not figure out. what is the difference between these 2 classes and which is better to use?
Those two results are virtually identical (in fact, they are identical in the raw HTTP request). The difference is how you get them.
QueryResponse is returned by jobs.query(). This method can be used to run a query, but has only limited configuration options. It is intended as a convenience function. For more query options (such as setting a destination table, allowing large results, etc), use jobs.insert(). Another limitation of jobs.query() is that it may time out before the query has completed. Partly, this is because many clients (such as in AppEngine) require all HTTP requests to finish within 30 seconds or so. If jobs.query() times out, it will still report a job id that can be used to fetch the results with jobs.get_query_results().
GetQueryResultsResponse is returned by jobs.get_query_results(). This can be used to get the results of a query started by either jobs.query() or jobs.insert(). Query results (if you don't specify a destination table) are available for 24 hours after the query completes. jobs.get_query_results() allows you to fetch these results at any time. jobs.query() only gives you the query results once.
There is a further difference between the two, which is that jobs.query() just returns the first page of results. jobs.get_query_results() can be used to get multiple pages of results.
Hopefully this clarifies things a bit.

BigQuery paging issues with tableData.list()

We're trying to load 120,000 rows from a BigQuery table using tableData.list. Our first call,
https://www.googleapis.com/bigquery/v2/projects/{myProject}/datasets/{myDataSet}/tables/{myTable}/data?maxResults=90000
returns a pageToken as expected and the first 22482 (1 to 22482) rows. We assume this is due to the 10mb serialized JSON limit. Our second call however,
https://www.googleapis.com/bigquery/v2/projects/{myProject}/datasets/{myDataSet}/tables/{myTable}/data?maxResults=90000&pageToken=CIDBB777777QOGQIBCIL6BIQSC7QK===
returns not the next rows, but 22137 rows starting at row 900001 to the 112137, without a pageToken.
Strangely, If we change maxResults to 100,000, we get rows starting from 100,000.
We are able to work around this using startRowIndex to page. In this case, we start with the first call being startRowIndex =0 and we never get a pageToken in the response. We keep making calls until all rows are retrieved. However, the concern is without pageTokens, if the row order changes while the calls are being made, the data could be invalid.
We are able to reproduce this behavior on multiple tables with different sizes and structures.
Is there any issue with paging, or should we be structuring our calls differently?
This is a known, high priority bug. That has been fixed.

WCF stateful service

Or - at least I think the correct term is stateful. I got a wcf service, listing lots of data back to me. So much data in fact, that I'm exceeding maxrecievedmessagesize - and the program crashes.
I've come to realize that I need to split the calls to the db. Instead of retrieving 5000 rows, I need to get row 1 - 200, remember the id of row number 200, get the next 200 rows from the id of row number 200 and so on.
Does anyone know how to do this? Is stateful (as in 'opposite of stateless') the correct way to go? And how would I proceed...? Could someone point me to an example?
You do not need stateful service in your scenario. Stateful services is better to avoid, especially when you want to save 5000 rows there.
Client should specified how much data it needs. So it could be method GetRows(index, amount), where index is start index for getting rows and amount of rows to get beginning from starting index.
Also client should ask about data state from service, and service just sends data state. For example when you have these 5000 rows, you could have method on service GetRowsState(index, amount) and the same story it's just saying you the last updated time for your rows, when time you have received is higher or other then on client, then once more to GetRows from server to update client data state.

SQL connection lifetime

I am working on an API to query a database server (Oracle in my case) to retrieve massive amount of data. (This is actually a layer on top of JDBC.)
The API I created tries to limit as much as possible the loading of every queried information into memory. I mean that I prefer to iterate over the result set and process the returned row one by one instead of loading every rows in memory and process them later.
But I am wondering if this is the best practice since it has some issues:
The result set is kept during the whole processing, if the processing is as long as retrieving the data, it means that my result set will be open twice as long
Doing another query inside my processing loop means opening another result set while I am already using one, it may not be a good idea to start opening too much result sets simultaneously.
On the other side, it has some advantages:
I never have more than one row of data in memory for a result set, since my queries tend to return around 100k rows, it may be worth it.
Since my framework is heavily based on functionnal programming concepts, I never rely on multiple rows being in memory at the same time.
Starting the processing on the first rows returned while the database engine is still returning other rows is a great performance boost.
In response to Gandalf, I add some more information:
I will always have to process the entire result set
I am not doing any aggregation of rows
I am integrating with a master data management application and retrieving data in order to either validate them or export them using many different formats (to the ERP, to the web platform, etc.)
There is no universal answer. I personally implemented both solutions dozens of times.
This depends of what matters more for you: memory or network traffic.
If you have a fast network connection (LAN) and a poor client machine, then fetch data row by row from the server.
If you work over the Internet, then batch fetching will help you.
You can set prefetch count or your database layer properties and find a golden mean.
Rule of thumb is: fetch everything that you can keep without noticing it
if you need more detailed analysis, there are six factors involved:
Row generation responce time / rate(how soon Oracle generates first row / last row)
Row delivery response time / rate (how soon can you get first row / last row)
Row processing response time / rate (how soon can you show first row / last row)
One of them will be the bottleneck.
As a rule, rate and responce time are antagonists.
With prefetching, you can control the row delivery response time and row delivery rate: higher prefetch count will increase rate but decrease response time, lower prefetch count will do the opposite.
Choose which one is more important to you.
You can also do the following: create separate threads for fetching and processing.
Select just ehough rows to keep user amused in low prefetch mode (with high response time), then switch into high prefetch mode.
It will fetch the rows in the background and you can process them in the background too, while the user browses over the first rows.