I am fetching paginated data from bq since data is huge it takes a lot of time to process them.
while (results.hasNextPage()) {
results = results.getNextPage();
count += results.getValues().spliterator().getExactSizeIfKnown();
results
.getValues()
.forEach(row ->
{
//Some operations.
}
);
logger.info("Grouping completed in iteration {}. Progress: {} / {}", i, count, results.getTotalRows());
i++;
}
I examine my program with visualVm and I realize that majority of the time is spent on results.getNextPage line which is getting next page data. Is there any way to make it parallel? I mean fetching every batch of data(which is 20K in my case) in different thread. I am using java client com.google.cloud.bigquery
Each query writes to a destination table. If no destination table is provided, the BigQuery API automatically populates the destination table property with a reference to a temporary anonymous table.
Having that table you can use the tabledata.list API call to get the data from it. Under the optional params, you will see a startIndex parameter that you can set to whatever you want, and you can use in your pagination script.
You can run parallel API calls using different offsets that will speed your request.
You can refer to this document
to Page through results using the API.
Related
I need to query the Google BigQuery table and export the results to gzipped file.
This is my current code. The requirement is that the each row data should be new line (\n) delemited.
def batch_job_handler(args):
credentials = Credentials.from_service_account_info(service_account_info)
client = Client(project=service_account_info.get("project_id"),
credentials=credentials)
query_job = client.query(QUERY_STRING)
results = query_job.result() # Result's total_rows is 1300000 records
with gzip.open("data/query_result.json.gz", "wb") as file:
data = ""
for res in results:
data += json.dumps(dict(list(res.items()))) + "\n"
break
file.write(bytes(data, encoding="utf-8"))
The above solution works perfectly fine for small number of result but, gets too slow if result has 1300000 records.
Is it because of this line: json.dumps(dict(list(res.items()))) + "\n" as I am constructing a huge string by concatenating each records by new line.
As I am running this program in AWS batch, it is consuming too much time. I Need help on iterating over the result and writing to a file for millions of records in a faster way.
Check out the new BigQuery Storage API for quick reads:
https://cloud.google.com/bigquery/docs/reference/storage
For an example of the API at work, see this project:
https://github.com/GoogleCloudPlatform/spark-bigquery-connector
It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
Direct Streaming
It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Filtering
The new API allows column and limited predicate filtering to only read the data you are interested in.
Column Filtering
Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.
Predicate Filtering
The Storage API supports limited pushdown of predicate filters. It supports a single comparison to a literal
You should (in most cases) point your output from BigQuery query to a temp table and export that temp table to a Google Cloud Storage Bucket. From that bucket, you can download stuff locally. This is the fastest route to have results available locally. All else will be painfully slow, especially iterating over the results as BQ is not designed for that.
Is there a way to update two documents, each in a different
collection, in one request?
I know you can batch write using FIRWriteBatch - this seems to be limited to the same collection for any document updates. When trying to attach updates for documents in two different collections:
// Just for example
FIRWriteBatch *batch = FIRWriteBatch.new;
[batch updateData:#{#"posts" : #1} forDocument:[self.firebase.usersCollection documentWithPath:#"some_user_id"]];
[batch setData:#{#"test" : #"cool"} forDocument:[self.firebase.postsCollection documentWithPath:#"some_post_id"]];
[batch commitWithCompletion:^(NSError * _Nullable error) {
NSLog(#"error: %#", error.localizedDescription);
}];
It never executes - the app crashes and I get this:
Terminating app due to uncaught exception 'FIRInvalidArgumentException', reason:
'Provided document reference is from a different Firestore instance.'
Apparently the batch does not like updates in more than one collection.
Does anyone know how you can update two documents, each in a different collection, without one failing and the other succeeding?
I want to avoid, for example, successfully setting posts = 1 for a document in usersCollection, while failing to write a new document in postsCollection.
I understand it's pretty unlikely that one will write while the other fails, but in the case that it does happen, I obviously don't want inconsistent data.
NOTE:
For anyone who cares - I don't know if it ever will fail, but as of now I am running the transaction without reading the document before updating data... 🤷♂️ Cheers to -1 API call!
You should use a transaction, which is documented adjacent to batch writes:
Using the Cloud Firestore client libraries, you can group multiple
operations into a single transaction. Transactions are useful when you
want to update a field's value based on its current value, or the
value of some other field. You could increment a counter by creating a
transaction that reads the current value of the counter, increments
it, and writes the new value to Cloud Firestore.
You are not limited to a single collection when performing a transaction. You are just obliged to read the document before you write it:
A transaction consists of any number of get() operations followed by
any number of write operations such as set(), update(), or delete().
In the case of a concurrent edit, Cloud Firestore runs the entire
transaction again. For example, if a transaction reads documents and
another client modifies any of those documents, Cloud Firestore
retries the transaction. This feature ensures that the transaction
runs on up-to-date and consistent data.
I have a BigQuery table with millions of records.
I am able to paginate using the GetQueryResultsResponse.getPageToken() method. The getPageToken is returning null if the underlying BigQuery Table is getting new inserts. The pageToken works fine if there are no inserts happening.
How to avoid this and be able to traverse the table even when inserts are happening on the bigquery table?
I am using google-api-services-bigquery v2-rev330-1.22.0"
Not clear, but I think you are talking about tables.list pagination (query result cannot have records streaming into it)
In such cases - instead of pageToken you can use startIndex (along with maxResults)
Knowing items count in response (real page size) you can always calculate starting Index for next page to request (without using pageToken).
Having some extra management around those start Index in your app - you can manage paging in both directions (Next and Prev).
And of course you can always manage navigation to :
First Page (startIndex = 1)
and
Last Page (startIndex = totalRows - expected page size) .
One more note: in case if table is under streaming (has streaming buffer at a time you do list) totalRows can be not available - in this case you can use extra call to Tables: get API and get numRows instead
How does one go about calculating the bit size of each record in BigQuery sharded tables across a range of time?
Objective: how much has it grown over time
Nuances: Of the 70 some fields, some records would have nulls for most, some records would have long string text grabbed directly from the raw logs, and some of them could be float/integer/date types.
Wondering if there's an easy way to do a proxy count of the bit size for one day and then I can expand that to a range of time.
Example from my experience:
One of my tables is daily sharded table with daily size of 4-5TB. Schema has around 780 fields. I wanted to understand cost of each data-point (bit-size) [it was used then for calculating ROI based on cost/usage]
So, let me give you an idea on how cost (bit-size) side of it was approached.
The main piece here is use of dryRun property of Jobs: Query API
Setting dryRun to true allows BigQuery (instead of actually running job) return statistics about the job such as how many bytes would be processed. And that’s exactly what is needed here!
So, for example, below Request is designed to get cost of trafficSource.referralPath in ga_session table for 2017-01-05
POST https://www.googleapis.com/bigquery/v2/projects/yourBillingProject/queries?key={YOUR_API_KEY}
{
"query": "SELECT trafficSource.referralPath FROM yourProject.yourDataset.ga_sessions_20170105`",
"dryRun": true,
"useLegacySql": false
}
You can get this value by parsing totalBytesProcessed out of Response. See example of such response below
{
"kind": "bigquery#queryResponse",
"jobReference": {
"projectId": "yourBillingProject"
},
"totalBytesProcessed": "371385",
"jobComplete": true,
"cacheHit": false
}
So, you can write relatively simple script in the client of your choice that:
reads schema of your table – you can use Tables: get API for this or if schema is known and readily available you can just simply hardcode it
organize loop through all (each and every) field in the schema
inside loop – call query api and extract size of respective filed (as it is outlined above)) and of course log it (or just collect it in memory)
As a result of above - you will have list of all fields with their respective size
If now, you need to analyze those sizes changes over the time – you can wrap above with yet another loop where you will iterate through as many days as you need and collect stats for each and every day
if you are not interested in day-by-day analysis - you just can make sure your query actually queries the range you are interested with. This can be done with use of a Wildcard Table
I consider this relatively easy way to go with
Me personally, I remember doing this with Go-lang, but it doesn't matter - you can use any client that you are most comfortable with
Hope this will help you!
i need to get a large amount of data from a remote database. the idea is do a sort of pagination, like this
1 Select a first block of datas
SELECT * FROM TABLE LIMIT 1,10000
2 Process that block
while(mysql_fetch_array()...){
//do something
}
3 Get next block
and so on.
Assuming 10000 is an allowable dimension for my system, let us suppose i have 30000 records to get: i perform 3 call to remote system.
But my question is: when executing a select, the resultset is transmitted and than stored in some local part with the result that fetch is local, or result set is stored in remote system and records coming one by one at any fetch? Because if the real scenario is the second i don't perform 3 call, but 30000 call, and is not what i want.
I hope I explained, thanks for help
bye
First, it's highly recommended to utilize MySQLi or PDO instead of the deprecated mysql_* functions
http://php.net/manual/en/mysqlinfo.api.choosing.php
By default with the mysql and mysqli extensions, the entire result set is loaded into PHP's memory when executing the query, but this can be changed to load results on demand as rows are retrieved if needed or desired.
mysql
mysql_query() buffers the entire result set in PHP's memory
mysql_unbuffered_query() only retrieves data from the database as rows are requested
mysqli
mysqli::query()
The $resultmode parameter determines behaviour.
The default value of MYSQLI_STORE_RESULT causes the entire result set to be transfered to PHP's memory, but using MYSQLI_USE_RESULT will cause the rows to be retrieved as requested.
PDO by default will load data as needed when using PDO::query() or PDO::prepare() to execute the query and retrieving results with PDO::fetch().
To retrieve all data from the result set into a PHP array, you can use PDO::fetchAll()
Prepared statements can also use the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY constant, though PDO::fetchALL() is recommended.
It's probably best to stick with the default behaviour and benchmark any changes to determine if they actually have any positive results; the overhead of transferring results individually may be minor, and other factors may be more important in determining the optimal method.
You would be performing 3 calls, not 30.000. That's for sure.
Each 10.000 results batch is rendered on the server (by performing each of the 3 queries). Your while iterates through a set of data that has already been returned by MySQL (that's why you don't have 30.000 queries).
That is assuming you would have something like this:
$res = mysql_query(...);
while ($row = mysql_fetch_array($res)) {
//do something with $row
}
Anything you do inside the while loop by making use of $row has to do with already-fetched data from your initial query.
Hope this answers your question.
according to the documentation here all the data is fetched to the server, then you go through it.
from the page:
Returns an array of strings that corresponds to the fetched row, or FALSE if there are no more rows.
In addition it seams this is deprecated so you might want to use something else that is suggested there.