scheduled export from Google BigQuery to Google Cloud Storage

scheduled export from Google BigQuery to Google Cloud Storage - google-bigquery

I'm new to Google Cloud and would like to know best use cases on how to schedule queries and export them to Google Cloud Storage.
I've seen documentations on how to manually export data but couldn't find anything specific on doing it in an automated way.
Is there any best way on how to approach this ?
Thanks

It is possible to create scheduled export jobs with the scheduled queries feature and EXPORT DATA statement. For example, this script below backups data daily to GCS as Parquet files with SNAPPY compression. Each time the job is executed it takes all the data from the day before.
DECLARE backup_date DATE DEFAULT DATE_SUB(#run_date, INTERVAL 1 day);
EXPORT DATA
OPTIONS ( uri = CONCAT('gs://my-bucket/', CAST(backup_date AS STRING), '/*.parquet'),
format='PARQUET',
compression='SNAPPY',
overwrite=FALSE ) AS
SELECT
*
FROM
`my-project.my-dataset.my-table`
WHERE
DATE(timestamp) = backup_date
From the BiqQuery UI you can then create a scheduled query and set the trigger frequency and trigger time.

Implement your table export function [1] using Node.js, python or Go. These languages are supported by Cloud Functions and BigQuery.
Insert the above function in Cloud Function [2] service which is an event driven serverless compute platform.
Trigger the above function using Cloud Scheduler [3] using a schedule interval of your choice. The schedule interval can be specified using a cron expression. The scheduler will trigger the function via a REST call on the function´s endpoint.
Verify the success of the above operation by visiting your bucket and ensuring that the table(s) has been successfully exported.

Related

How to query date-partitioned Google BigQuery table using AWS Glue BigQuery Connector?

I have linked Firebase events to BigQuery and my goal is to pull the events into S3 from BigQuery using AWS Glue.
When you link Firebase to BigQuery, it creates a default dataset and a date-partitioned table something like this:
analytics_456985675.events_20230101
analytics_456985675.events_20230102
I'm used to querying the events in BigQuery using
Select
...
from analytics_456985675.events_*
where date >= [date]
However, when configuring the Glue ETL job, it refuses to work with this format for a table analytics_456985675.events_* I get this error message:
it seems the Glue job will only work when I specify a single table.
How can I create a Glue ETL job that pulls data from BigQuery incrementally if I have to specify a single partition table?

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:

BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

BigQuery - Transfers automation from Google Cloud Storage - Overwrite table

Here's the case:
Our client daily uploads CSVs (overwritten) to a bucket in Google Cloud Storage (each table in a different file).
We use BigQuery as DataSource in DataStudio
We want to automatically transfer the CSVs to BigQuery.
The thing is, even though we've:
Declared the tables in BigQuery with "Overwrite table" write preference option
Configured the daily Transfers vía UI (BigQuery > Transfers) to automatically upload the CSVs from Google Cloud one hour after the files are uploaded to Google Cloud, as stated by the limitations.
The automated transfer/load is by default in "WRITE_APPEND", thus the tables are appended instead of overwritten in BigQuery.
Hence the question: How/where can we change the
configuration.load.writeDisposition = WRITE_TRUNCATE
as stated here in order to overwrite the tables when the CSVs are automatically loaded?
I think that's what we're missing.
Cheers.

None of the above worked for us, so I'm posting this in case anyone has the same issue.
We scheduled a query to erase the table content just before the automatic importation process starts:
DELETE FROM project.tableName WHERE true
And then, new data will be imported to a void table, therefore default "WRITE_APPEND" doesn't affect us.

1) One way to do this is to use DDL to CREATE and REPLACE your table before running the query which imports the data.
This is an example of how to create a table
#standardSQL
CREATE TABLE mydataset.top_words
OPTIONS(
description="Top ten words per Shakespeare corpus"
) AS
SELECT
corpus,
ARRAY_AGG(STRUCT(word, word_count) ORDER BY word_count DESC LIMIT 10) AS top_words
FROM bigquery-public-data.samples.shakespeare
GROUP BY corpus;
Now that it's created you can import your data.
2) Another way is to use BigQuery schedule Queries
3) If you write Python you can find an even better solution here

Executing a Dataprep template with Dataflow API holds the timestamp included in the flow recipe

I have a cloud function which uses the dataflow API to create a new job from a template I created using DataPrep. The recipe basically cleans up some JSON objects, turn them into CSV format, and add a timestamp column to fetch everything in a BigQuery database. The main idea is to take a snapshot of certain information of our platform.
I managed to run the job from the dataflow API, and the data is correctly inserted in the bigquery table, however in the timestamp field, the value of the timestamp is always the same, and it corresponds to the execution time from the job where I take the template from(DataPrep template). When I run the job from the dataprep interface, this timestamp is correctly inserted, but it is not changed when I execute the job with the same template from the cloud function.
The snippet of code which calls the dataflow API:
dataflow.projects.templates.launch({
projectId: projectId,
location: location,
gcsPath: jobTemplateUrl,
resource: {
parameters: {
inputLocations : `{"location1" :"gs://${file.bucket}/${file.name}"}`,
outputLocations: `{"location1" : "${location2}"}`,
customGcsTempLocation: `gs://${destination.bucket}/${destination.tempFolder}`
},
environment: {
tempLocation: `gs://${destination.bucket}/${destination.tempFolder}`,
zone: "us-central1-f"
},
jobName: 'user-global-stats-flow',
}
}
This is the Dataflow execution console snapshot, as it can be seen the latest jobs are the ones executed from the cloud function, the one in the bottom was executed from the Dataprep interface:
Dataflow console snapshot
This is the part of the recipe in charge of creating the timestamp:
Dataprep recipe sample
Finally this is what is inserted in the BigQuery table, where the first insertion with the same timestamp (row 4) corresponds to the job executed from Dataprep, and the rest are the executions from the cloud function with the Dataflow API:
Big Query Insertions
So the question is whether there is a way I can make the timestamp to get resolved during the job execution time for the insertion, because now it looks like it is fixed in the template's recipe.
Thanks for your help in advance.

If I understand correctly, this is documented behaviour. From the list of known limitations when running a Dataprep template through Dataflow:
All relative functions are computed based on the moment of execution. Functions such as NOW() and TODAY are not recomputed when the Cloud Dataflow template is executed.

Exporting query results as JSON via Google BigQuery API

I've got jobs/queries that return a few hundred thousand rows. I'd like to get the results of the query and write them as json in a storage bucket.
Is there any straightforward way of doing this? Right now the only method I can think of is:
set allowLargeResults to true
set a randomly named destination table to hold the query output
create a 2nd job to extract the data in the "temporary" destination table to a file in a storage bucket
delete the random "temporary" table.
This just seems a bit messy and roundabout. I'm going to be wrapping all this in a service hooked up to a UI that would have lots of users hitting it and would rather not be in the business of managing all these temporary tables.

1) As you mention the steps are good. You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntax.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
With this approach you first need to export to GCS, then to transfer to local machine. If you have a message queue system (like Beanstalkd) in place to drive all these it's easy to do a chain of operation: submit jobs, monitor state of the job, when done initiate export to GCS, then delete the temp table.
Please also know that you can update a table via the API and set the expirationTime property, with this aproach you don't need to delete it.
2) If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

scheduled export from Google BigQuery to Google Cloud Storage - google-bigquery

Related

How to query date-partitioned Google BigQuery table using AWS Glue BigQuery Connector?

BigQuery streaming insert from Dataflow - no results

BigQuery - Transfers automation from Google Cloud Storage - Overwrite table

Executing a Dataprep template with Dataflow API holds the timestamp included in the flow recipe

Exporting query results as JSON via Google BigQuery API

Categories

Resources