BigQuery - Scheduled Query Update Delete via CLI - google-bigquery

The BigQuery's Scheduled Query is an interesting one and for GCP to enable the creation of one programmatically via its CLI tool offers some flexibilities and conveniences. After creating a few SQs I realised, from time to time, updates needs to be done on one or several of the SQ.
The challenge here is finding a way to update/ or delete/ or enable/disable an existing scheduled query via the CLI. GCP's documentation clearly explained how one can create using either the bq query or bq mk command. There is absolutely nowhere to suggest we can delete or update/modify an existing scheduled query either via CLI or Python.
I was thinking of bq rm but there isn't a flag to specifically deletes a scheduled query. I may be risking dropping an entire dataset or table.
Perhaps it is a limitation at the moment. However, if anyone has found a way to do so, please share your solution or your way around.

sorry for the confusion.
Scheduled query does support update/delete in CLI. Scheduled query is managed as a transfer config in CLI, please see update transfer config and delete transfer config.
For example, to update/delete a scheduled query with name projects/p/locations/us/transferConfigs/scheduled_query:
# Update the query parameter in a scheduled query.
bq update --transfer_config --params='{"query":"SELECT 1"}' projects/p/locations/us/transferConfigs/scheduled_query
# Delete a scheduled query.
bq rm --transfer_config projects/p/locations/us/transferConfigs/scheduled_query
Hope this helps!

Related

Running BQ Load command using Google cloud scheduler

I would like to run bq load command once every day at 00:00 UTC. Can I use Google Cloud scheduler for scheduling this command?
As mentioned by #Daniel, there is no direct way to use cloud scheduler to execute-schedule queries, however there are options you can consider to run queries on schedule.
Use scheduled queries directly in BQ
Since your source is GCS, you can load data from GCS to BQ and then execute scheduled queries like mentioned here
Use scheduled Cloud Function to run queries
Schedule using Data Transfer
You can also try what #Graham Polley has mentioned in this blog, which requires an architecture combining Cloud Scheduler, Cloud Sourse Repositories and Cloud Build
Assuming you have a file that is being loaded into Cloud Storage everyday before 7am, you may consider a more resilient design: when the file is created in CS create a notification that starts the process to load it. It is a better design that will get the information earlier into BigQuery and it will keep working even if the file creation is delayed.
When the file is created in Cloud Storage get a message in PubSub: https://cloud.google.com/storage/docs/pubsub-notifications
Then, a Cloud Function is invoked that will execute the bq load command.
BTW if you have many files or even some dependencies, consider using Cloud Composer as an orchestrator to keep its complexity under control.
You would not be able to do it directly with Cloud Scheduler you would need an intermediary like a Cloud Function to execute a command. Alternatively you could try scheduling a data transfer, depending on the requirements of your load job.
Here is an example from the documentation:
https://cloud.google.com/bigquery/docs/cloud-storage-transfer#setting_up_a_cloud_storage_transfer
Based on your update of desiring to shard the table based on date, try scheduled queries in the following manner.
Create an external table pointed to the desired path in GCS as described here
Define your query, i recommend defining a query with column names and appropriate casting.
SELECT *
FROM myproject.dataset_id.external_table_name
-- INCLUDE FILTERING ON _FILE_NAME IF NEEDED LIKE FOLLOWING:
-- WHERE _FILE_NAME LIKE SOME_VALUE
Create Schedule Query with Run_Date Parmeter in the table name like new_table_{run_date}

How to disable Scheduled queries in Big Query Console

I have close to 1000 queries scheduled in the big query cloud platform. However i would now have to pause all of them. One approach is to manually disable one after the other, but this would be an exhaustive task. Is there any way to disable all the active queries in one go?
Disabling scheduled queries (transfer) is not supported by the CLI, but it can be done using Console/API:
Document Link:
https://cloud.google.com/bigquery-transfer/docs/working-with-transfers
Disabling a transfer using API document:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/patch
List transfer config names:
bq ls --transfer_config --transfer_location=US --format=prettyjson | jq '.[].name'

Different ways of updating bigquery table

In gcp, I need to update a bigquery table whenever a file (multiple formats such as json,xml) gets uploaded into a bucket. I have two options but not sure what are the pros/cons of each of them. Can someone suggest which is a better solution and why?
Approach 1 :
File uploaded to bucket --> Trigger Cloud Function (which updates the bigquery table) -->Bigquery
Approach 2:
File uploaded to bucket --> Trigger Cloud Function (which triggers a dataflow job) -->Dataflow-->Bigquery.
In production env, which approach is better suited and why? If there are alternative approaches,pls let me know.
This is quite a broad question, so I wouldn't be surprised if it gets voted to be closed. That said however, I'd always go #2 (GCS -> CF -> Dataflow -> BigQuery).
Remember, with Cloud Funtions there is a max execution time. If you kick off a load job from the Cloud Function, you'll need to bake logic into it to poll and check the status (load jobs in BigQuery are async). If it fails, you'll need to handle it. But, what if it's still running and you hit the max execution of your Cloud Function?
At least by using Dataflow, you don't have the problem of max execution times and you can simply rerun your pipeline if it fails for some transient reason e.g. network issues.

Automate the process of Pig, Hive, Sqoop

I have data in HDFS(Azure HDInsight) in csv format. I am using Pig to process this Data. After processing in Pig the Summarize data will be stored in Hive. And then Hive table is exported in RDBMS using Sqoop. Now I need to automate all this process. Is this possible that I will write particular method for all these 3 task in MapReduce, then run this MapReduce job, and all these task execute one by one.
​For create MapReduce job , I want to use .Net SDK. So my question is this possible, and if YES than suggest some steps and reference link for this Question.
​Thank You.
If you need to run those task periodically I would recommend using Oozie. Check out existing example and it have fairly good documentation
If you don't have this framework on your cloud, you can write your own MR, but I you have Oozie you can write DAG flow where each action on the graph can be pig/bash/hive/hdfs and more.
It can run every X day/hours/min and can email you in case of failure

Automatic Hive or Cascading for ETL in AWS-EMR

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html