Is it possible to export all Bigquery scheduled queries listed into a csv? - google-bigquery

I need to create programmatic approach in getting the updated Bigquery scheduled queries listed in csv (or Gsheet / Bigquery table if possible). But I cannot find the related documentation for that. For now I can only select all text manually from Bigquery scheduled query page.
Below are the information needed:
Display name and its URL
Schedule (UTC)
Next scheduled
Author
destination dataset and destination table
But with the new scheduled query is still being created, it is getting more complicated to track with the list is still growing.

To list the scheduled queries with the bq CLI:
bq ls --transfer_config --transfer_location=US --format=prettyjson
For viewing the details of a Schedule Query:
bq show --transfer_config [RESOURCE_NAME]
# [RESOURCE_NAME] is the value from the above bq ls command
In python you can use the below code to list transfer configurations in a project.
from google.cloud import bigquery_datatransfer
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
project_id = "my-project"
parent = transfer_client.common_project_path(project_id)
configs = transfer_client.list_transfer_configs(parent=parent)
print("Got the following configs:")
for config in configs:
print(f"\tID: {config.name}, Schedule: {config.schedule}")
For more information you can refer to link1 and link2.

Related

Adding Labels to Big Query Table from Pyspark job on Dataproc using Spark BQ Connector

I am trying to use py-spark on google dataproc cluster to run a spark job and writing results to a Big Query table.
Spark Bigquery Connector Documentation - https://github.com/GoogleCloudDataproc/spark-bigquery-connector
The requirement is during the creation of the table, there are certain labels that should be present on the big query table.
The spark bq connector does not provide any provision to add labels for write operation
df.write.format("bigquery") \
.mode("overwrite") \
.option("temporaryGcsBucket", "tempdataprocbqpath") \
.option("createDisposition", "CREATE_IF_NEEDED") \
.save("abc.tg_dataset_1.test_table_with_labels")
The above command creates bigquery load job in background that loads the table with the data.
Having checked further, the big query load job syntax itself does not support addition of labels in contrast to big query - query job.
Is there any plan to support the below
Support for labels in big query load job
Support for labels in write operation of spark bq connector.
Since there is no provision to add labels during the load/write operation, the current workaround used is to have the table created with schema/labels before the pyspark job

Dataprep is leaving Datasets/Tables behind in BigQuery

I am using Google Cloud Dataprep for processing data stored in BigQuery. I am having an issue with dataprep/dataflow creates a new dataset with a name starting with "temp_dataset_beam_job_"
It seems to crate the temporary dataset both for failed and successful dataflow jobs, that dataprep creates. This is an issue as BigQuery becomes messy very quickly with all these flows.
This has not been an issue in the past.
A similar issue has been described in this in this GitHub thread: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609
Is there any way of not creating temporary datasets, or instead creating them in a Cloud Storage folder?
I wrote a cleanup script that I am running in Cloud Run (see this article) using Cloud Scheduler.
Below is the script:
#!/bin/bash
PROJECT={PROJECT_NAME}
# get list of datasets with temp_dataset_beam
# optional: write list of files to cloud storage
obj="gs://{BUCKET_NAME}/maintenance-report-$(date +%s).txt"
bq ls --max_results=100 | grep "temp_dataset_beam" | gsutil -q cp -J - "${obj}"
datasets=$(bq ls --max_results=100 | grep "temp_dataset_beam")
for dataset in $datasets
do
echo $PROJECT:$dataset
# WARNING: Uncomment the line below to remove datasets
# bq rm --dataset=true --force=true $PROJECT:$dataset
done
I solved this in Dataprep directly by running a SQL script post data publish that will run after each job. You can set this in Dataprep in the output Manual settings.
(SELECT CONCAT("drop table `<project_id>.",table_schema,".", table_name, "`;" ) AS value
FROM <dataset>.INFORMATION_SCHEMA.TABLES -- or region.INFORMATION_SCHEMA.TABLES
WHERE table_name LIKE "Dataprep_%"
ORDER BY table_name DESC)
DO
EXECUTE IMMEDIATE(drop_statement.value);--Here the table is dropped
END FOR;

How Bigquery use data stored in google cloud?

Guys a very basic question but not able to decipher ,Please help me out.
Q1: When we create bigquery table using below command , the data resides in same Cloud Storage?
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
Q2: let's say my data director is gs://sp2040/raw/cards/cust/ for customer file Table structure defined is:
bq mk --time_partitioning_type=DAY market.cust \
custid:string,grp:integer,odate:string
Everyday I create new dir in the bucket such as 20170101,20170102..to load new dataset. So after the data loaded in this bucket Do I need to fire below queries.
D1:
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
D2:
bq load --source_format=CSV 'market.cust$20170102' \
gs://sp2040/raw/cards/cust/20170102/20170102_cust.csv
When we create bigquery table using below command , the data resides in same Cloud Storage?
Nope! BigQuery is not using Cloud Storage for storing data (unless it is federated Table linked to Cloud Storage)
Check BigQuery Under the Hood with Tino Tereshko and Jordan Tigani - you will like it
Do I need to fire below queries
Yes. you need to load those files into BigQuery, so you can query the data
Yes you would need load the data into BigQuery using those commands.
However, there are a couple of alternatives
PubSub and Dataflow: You could configure PubSub to watch your cloud storage and create notification when files are added, described here. You could then have Dataflow job that imported the file into BigQuery. DataFlow Documentation
BigQuery external tables: BigQuery can query cvs files that are stored in Cloud Storage without importing the data, as described here. There is wildcard support for filenames so it could be configured once. Performance might not be as good as directly storing items in BigQuery

Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!
There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.
So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.

BigQuery bq command - load only if table is empty or doesn't exist

I'm executing a load command with bq, e.g:
bq load ds.table gs://mybucket/data.csv dt:TIMESTAMP,f1:INTEGER
I would like to load the data only if the table is empty or doesn't exist.
Is it possible?
EDIT:
Basically I would like the WRITE_EMPTY API option via the bq command line tool:
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.writeDisposition
If the table already exists and contains data, a 'duplicate' error is returned in the job result.
If you go check bq.py, that has the source code for the BigQuery CLI, you'll find out that the _Load() method doesn't implement an option for the WRITE_EMPTY API option. It's either the default WRITE_APPEND or the optional WRITE_TRUNCATE.
As you indicate, the API does support WRITE_EMPTY - if you want to see this as an option on the CLI, you can submit a feature request at https://code.google.com/p/google-bigquery/issues/list?q=label:Feature-Request
You can use the BQ command-line tool.
Get Table Information
bq show <project_id>:<dataset_id>.<table_id>
List tables
bq ls [project_id:][dataset_id]