Listing BigQuery Tables in `huge/big` Datasets - 30K-40K+ tables - google-bigquery

The task is to programmatically list all the tables within the given dataset with more than 30-40K tables
Initial option we explored was using tables.list API (as we do all the times for normal datasets with reasonable number of tables in them)
Looks like this API returns max 1000 entries (even if we try to set maxResults to bigger value)
To take next 1000 we need to “wait” for response of previous request then extract pageToken and repeat call and so on
For the datasets with 30K – 40K+ this can take up to 10-15 and more sec (under good weather)
So the timing is a problem for us that we want to address!
In above mentioned calls we are getting back only nextPageToken and tables/tableReference/tableId so size of response is extremely small!
Question:
Is there way to somehow increase maxResults, so to get all tables in one (or very few) call(s) (assuming it will be much faster than doing 30-40 calls)?
The workaround we tried so far is to use __TABLES_SUMMARY__ with jobs.insert or jobs.query API.
This way – the whole result is returned within the seconds – but in our particular case – using BigQuery jobs API is not an option for multiple reasons. We want to be able to use list API

Related

BigQuery-Java: difference between QueryResponse and GetQueryResultsResponse

In sample code provided by Google, 2 classes are used to fetch results. QueryResponse and GetQueryResultsResponse.
I am not able to understand purpose of these 2 classes and do we have to use these 2 classes?
We are getting data from both: queryResponse.getRows() and queryResults.getRows()
I have gone through docs but could not figure out. what is the difference between these 2 classes and which is better to use?
Those two results are virtually identical (in fact, they are identical in the raw HTTP request). The difference is how you get them.
QueryResponse is returned by jobs.query(). This method can be used to run a query, but has only limited configuration options. It is intended as a convenience function. For more query options (such as setting a destination table, allowing large results, etc), use jobs.insert(). Another limitation of jobs.query() is that it may time out before the query has completed. Partly, this is because many clients (such as in AppEngine) require all HTTP requests to finish within 30 seconds or so. If jobs.query() times out, it will still report a job id that can be used to fetch the results with jobs.get_query_results().
GetQueryResultsResponse is returned by jobs.get_query_results(). This can be used to get the results of a query started by either jobs.query() or jobs.insert(). Query results (if you don't specify a destination table) are available for 24 hours after the query completes. jobs.get_query_results() allows you to fetch these results at any time. jobs.query() only gives you the query results once.
There is a further difference between the two, which is that jobs.query() just returns the first page of results. jobs.get_query_results() can be used to get multiple pages of results.
Hopefully this clarifies things a bit.

Problems loading a series of snapshots by date

I have been running into a consistent problem using the LBAPI which I feel is probably a common use case given its purpose. I am generating a chart which uses LBAPI snapshots of a group of Portfolio Items to calculate the chart series. I know the minimum and maximum snapshot dates, and need to query once a day in between these two dates. There are two main ways I have found to accomplish this, both of which are not ideal:
Use the _ValidFrom and _ValidTo filter properties to limit the results to snapshots within the selected timeframe. This is bad because it will also load snapshots which I don't particularly care about. For instance if a PI is revised several times throughout the day, I'm really only concerned with the last valid snapshot of that day. Because some of the PIs I'm looking for have been revised several thousand times, this method requires pulling mostly data I'm not interested in, which results in unnecessarily long load times.
Use the __At filter property and send a separate request for each query date. This method is not ideal because some charts would require several hundred requests, with many requests returning redundant results. For example if a PI wasn't modified for several days, each request within that time frame would return a separate instance of the same snapshot.
My workaround for this was to simulate the effect of __At, but with several filters per request. To do this, I added this filter to my request:
Rally.data.lookback.QueryFilter.or(_.map(queryDates, function(queryDate) {
return Rally.data.lookback.QueryFilter.and([{
property : '_ValidFrom',
operator : '<=',
value : queryDate
},{
property : '_ValidTo',
operator : '>=',
value : queryDate
}]);
}))
But of course, a new problem arises... Adding this filter results in much too large of a request to be sent via the LBAPI, unless querying for less than ~20 dates. Is there a way I can send larger filters to the LBAPI? Or will I need to break theis up into several requests, which only makes this solution slightly better than the second of the latter.
Any help would be much appreciated. Thanks!
Conner, my recommendation is to download all of the snapshots even the ones you don't want and marshal them on the client side. There is functionality in the Lumenize library that's bundled with the App SDK that makes this relatively easy and the TimeSeriesCalculator will also accomplish this for you with even more features like aggregating the data into series.

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

how to get next 1000 records the fastest way

I'm using Azure Table Storage.
Let's say i have a Partition in my Table with 10,000 records, and I would like to get records number 1000 to 1999. And next time i would like to get records number 4000 to 4999 etc.
What is the fastest way of doing that?
All I can find till now are two options, which I don't like very much:
1. run a query which returns all 10,000 records, and filter out what I want when I get all 10,000 records.
2. Run a query whichs returns 1000 records at a time, and use a continuation token to get the next 1000 records.
Is it possible to get a continuation token without downloading all corresponding records? It would be great if i can get Continuation Token 1, than get Continuation token 2, and with CT2 get records 2000 to 2999.
Theoretically you should be able to use continuation tokens without downloading the actual data for the first 1000 recors by closing the connection you have after the first request. And I mean closing it at TCP level. And before you read all data. Then open a new connection and use continuation token there. Two WebRequests will not do it since the HTTP implementation will likely use keep alive wchich means all your data is going to be read in the background even though you don't read it in your code. Actually you can configure your HTTP requests to not use keep alive.
However, another way is naturally if you know the RowKey and can search on that but I assume you don't know which row keys will be in each 1000 entity batch.
Last I would ask why you have this problem in the first place. And what your access pattern is. If inserts are common and getting these records is rare I wouldn't bother making it more efficient. if this is like a paging problem i would probably get all data on the first request and cache it (in the cloud). if inserts are rare but you need to run this query often I would consider making the insertion of data have one partion for every 1000 entities and rebalance as needed (due to sorting) as entities are inserted.

SQL connection lifetime

I am working on an API to query a database server (Oracle in my case) to retrieve massive amount of data. (This is actually a layer on top of JDBC.)
The API I created tries to limit as much as possible the loading of every queried information into memory. I mean that I prefer to iterate over the result set and process the returned row one by one instead of loading every rows in memory and process them later.
But I am wondering if this is the best practice since it has some issues:
The result set is kept during the whole processing, if the processing is as long as retrieving the data, it means that my result set will be open twice as long
Doing another query inside my processing loop means opening another result set while I am already using one, it may not be a good idea to start opening too much result sets simultaneously.
On the other side, it has some advantages:
I never have more than one row of data in memory for a result set, since my queries tend to return around 100k rows, it may be worth it.
Since my framework is heavily based on functionnal programming concepts, I never rely on multiple rows being in memory at the same time.
Starting the processing on the first rows returned while the database engine is still returning other rows is a great performance boost.
In response to Gandalf, I add some more information:
I will always have to process the entire result set
I am not doing any aggregation of rows
I am integrating with a master data management application and retrieving data in order to either validate them or export them using many different formats (to the ERP, to the web platform, etc.)
There is no universal answer. I personally implemented both solutions dozens of times.
This depends of what matters more for you: memory or network traffic.
If you have a fast network connection (LAN) and a poor client machine, then fetch data row by row from the server.
If you work over the Internet, then batch fetching will help you.
You can set prefetch count or your database layer properties and find a golden mean.
Rule of thumb is: fetch everything that you can keep without noticing it
if you need more detailed analysis, there are six factors involved:
Row generation responce time / rate(how soon Oracle generates first row / last row)
Row delivery response time / rate (how soon can you get first row / last row)
Row processing response time / rate (how soon can you show first row / last row)
One of them will be the bottleneck.
As a rule, rate and responce time are antagonists.
With prefetching, you can control the row delivery response time and row delivery rate: higher prefetch count will increase rate but decrease response time, lower prefetch count will do the opposite.
Choose which one is more important to you.
You can also do the following: create separate threads for fetching and processing.
Select just ehough rows to keep user amused in low prefetch mode (with high response time), then switch into high prefetch mode.
It will fetch the rows in the background and you can process them in the background too, while the user browses over the first rows.