google-bigquery Schedule a nightly table copy - google-bigquery

Is it possible to create a scheduled job / process to copy a bigquery table each night? I am trying to create automated nightly table backups, and I havent seen any examples of how to accomplish this.
Any help would be greatly appreciated.
Eric

You can use bq query tool to submit batch job and schedule it with cron (or Task Scheduler if on Windows). The command will look similar to this:
bq --nosync query --batch --allow_large_results --nouse_legacy_sql --replace --destination_table dataset.backup_table "select * from dataset.table"

Related

Is it possible to export all Bigquery scheduled queries listed into a csv?

I need to create programmatic approach in getting the updated Bigquery scheduled queries listed in csv (or Gsheet / Bigquery table if possible). But I cannot find the related documentation for that. For now I can only select all text manually from Bigquery scheduled query page.
Below are the information needed:
Display name and its URL
Schedule (UTC)
Next scheduled
Author
destination dataset and destination table
But with the new scheduled query is still being created, it is getting more complicated to track with the list is still growing.
To list the scheduled queries with the bq CLI:
bq ls --transfer_config --transfer_location=US --format=prettyjson
For viewing the details of a Schedule Query:
bq show --transfer_config [RESOURCE_NAME]
# [RESOURCE_NAME] is the value from the above bq ls command
In python you can use the below code to list transfer configurations in a project.
from google.cloud import bigquery_datatransfer
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
project_id = "my-project"
parent = transfer_client.common_project_path(project_id)
configs = transfer_client.list_transfer_configs(parent=parent)
print("Got the following configs:")
for config in configs:
print(f"\tID: {config.name}, Schedule: {config.schedule}")
For more information you can refer to link1 and link2.

BigQuery - scheduled query through CLI

Simple question regarding bq cli tool. I am fairly confident the answer is, as of the writing of this question, no, but may be wrong.
Is it possible to create a scheduled query (similar to seen in the screenshot below) using the bq cli tool?
Yes, scheduled queries now could be created with bq mk --transfer_config. Please see examples below:
To create a scheduled query with query SELECT 1:
bq mk --transfer_config --target_dataset=mydataset --display_name='My Scheduled Query' --schedule='every 24 hours' --params='{"query":"SELECT 1","destination_table_name_template":"mytable","write_disposition":"WRITE_TRUNCATE"}' --data_source=scheduled_query
Note:
--target_dataset is required.
--display_name is required.
In --params field, query is required and we only support Standard SQL queries.
In --params field, destination_table_name_template is optional for DML and DDL but required for regular SELECT queries.
In --params field, write_disposition is same as destination_table_name_template, required for regular SELECT queries but optional for DML and DDL.
--data_source needs to be always set to scheduled_query to create a scheduled query.
After a scheduled query is created successfully, you could expect a full resource name, for example:
Transfer configuration 'projects/<p>/locations/<l>/transferConfigs/5d1bec8c-0000-2e6a-a4eb-089e08248b78' successfully created.
To schedule a backfill for this scheduled query, for example:
bq mk --transfer_run --start_time 2017-05-25T00:00:00Z --end_time 2017-05-25T00:00:00Z projects/<p>/locations/<l>/transferConfigs/5d1bec8c-0000-2e6a-a4eb-089e08248b78
Hope this helps! Thank you for using scheduled queries!

How do you run a BigQuery query and store the results in a table on a regular basis

I have a BigQuery view which takes about 30 seconds to run. I want to, once a day at a designated time, run the view and store the results in a materialized table (e.g. so that Data Studio dashboards can use the table without making the dashboard take 30 seconds to load)
Is there a built-in way to do this using a tool like dataproc, or do you have to just set up a cronjob that just runs
CREATE TABLE dataset.materialized_view AS
SELECT *
FROM dataset.view;
on a regular basis?
You can achieve this using scheduled queries.
In the Classic BigQuery UI (Cloud Console UI support is under development at the time of this writing), write the query that you want to run in the "Compose Query" text area, then click the "Schedule Query" button. From the panel that appears, you can choose the frequency with which to run the query; the default is every 24 hours.
You can setup a regular cron job which runs the query to read data from you view and write it to a destination table. Based on your example, something like:
bq --location=[LOCATION] query -n 0 --destination_table dataset.materialized_view --use_legacy_sql=false --replace=true 'select * from dataset.view'

Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!
There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.
So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.

Use Bigquery API or bq command line tool to create new table

I am trying to come up with a programmatic way to generate a new bigquery table from a pre-existing table. I know how to do this to create a new view using the bq command line tool
bq --project_id='myProject' mk --view="SELECT * FROM [MYDATASET.database]" myProject.newTable
But that creates a view and that doesn't help me, I need to create a table for a bunch of reasons.
I would be happy to create a view and then be able to generate a new table from that view periodically, but I can't figure out how to do that without doing it manually though the bigquery web interface.
I'd appreciate any help.
Brad
If it's a normal table (not a view), you can use the copy command:
bq cp <source> <destination>
If you're trying to materialize a view, or if you need to modify the contents of the table in the process (e.g., adding/removing/transforming fields), you can run a query with a destination table:
bq query \
--destination_table=<destination> \
--allow_large_results \
--noflatten_results \
'SELECT ... FROM <source>'
The query option is more powerful, but you'll get charged for running the query. The copy is free.