Bigquery data transfer trigger on-demand - google-bigquery

I have implemented a transfer job from S3 to BQ and configured it to be on-demand. I have used this guide. How can I trigger a data transfer configured to be on-demand via either the console or a CLI / API calls?

To run your job using the Console, click on your job, then click the option 'More' on the upper right corner and click the option 'Refresh Transfer'
Also, it seems like there is currently some improper behavior with the new BigQuery UI, as well as the transfer jobs. Most likely, you will see in the Console that your transfer job, even if configured as 'On-Demand', shows in transfer details as scheduled to run every 24h, but running this command (1), you will see how the job schedule option 'disableAutoScheduling' is set to true. Bear in mind this product is still in Beta, so there is still work in progress.
(1): bq show --format=prettyjson --transfer_config [resource_name]

Related

Running BQ Load command using Google cloud scheduler

I would like to run bq load command once every day at 00:00 UTC. Can I use Google Cloud scheduler for scheduling this command?
As mentioned by #Daniel, there is no direct way to use cloud scheduler to execute-schedule queries, however there are options you can consider to run queries on schedule.
Use scheduled queries directly in BQ
Since your source is GCS, you can load data from GCS to BQ and then execute scheduled queries like mentioned here
Use scheduled Cloud Function to run queries
Schedule using Data Transfer
You can also try what #Graham Polley has mentioned in this blog, which requires an architecture combining Cloud Scheduler, Cloud Sourse Repositories and Cloud Build
Assuming you have a file that is being loaded into Cloud Storage everyday before 7am, you may consider a more resilient design: when the file is created in CS create a notification that starts the process to load it. It is a better design that will get the information earlier into BigQuery and it will keep working even if the file creation is delayed.
When the file is created in Cloud Storage get a message in PubSub: https://cloud.google.com/storage/docs/pubsub-notifications
Then, a Cloud Function is invoked that will execute the bq load command.
BTW if you have many files or even some dependencies, consider using Cloud Composer as an orchestrator to keep its complexity under control.
You would not be able to do it directly with Cloud Scheduler you would need an intermediary like a Cloud Function to execute a command. Alternatively you could try scheduling a data transfer, depending on the requirements of your load job.
Here is an example from the documentation:
https://cloud.google.com/bigquery/docs/cloud-storage-transfer#setting_up_a_cloud_storage_transfer
Based on your update of desiring to shard the table based on date, try scheduled queries in the following manner.
Create an external table pointed to the desired path in GCS as described here
Define your query, i recommend defining a query with column names and appropriate casting.
SELECT *
FROM myproject.dataset_id.external_table_name
-- INCLUDE FILTERING ON _FILE_NAME IF NEEDED LIKE FOLLOWING:
-- WHERE _FILE_NAME LIKE SOME_VALUE
Create Schedule Query with Run_Date Parmeter in the table name like new_table_{run_date}

How to make Dataproc detect Python-Hive connection as a Yarn Job?

I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)
Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".
Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.
Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.
Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?
Thanks.
This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.
If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:
gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
--execute="sh sleep $((5 * 60 * 60))"

How to disable Scheduled queries in Big Query Console

I have close to 1000 queries scheduled in the big query cloud platform. However i would now have to pause all of them. One approach is to manually disable one after the other, but this would be an exhaustive task. Is there any way to disable all the active queries in one go?
Disabling scheduled queries (transfer) is not supported by the CLI, but it can be done using Console/API:
Document Link:
https://cloud.google.com/bigquery-transfer/docs/working-with-transfers
Disabling a transfer using API document:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/patch
List transfer config names:
bq ls --transfer_config --transfer_location=US --format=prettyjson | jq '.[].name'

Loading data to BigQuery using python API vs bq load

Got a new requirement. In GCS bucket have around 130+ files and these files need to be loaded into different tables on BigQuery on daily basis.
After researching, I found two options.
1) Use "bq load" command to load (Shell Script/Python Script)
2) Create a Python API to load the data to BigQuery
Which option is best. If I go with Python API, I need use APPENGINE to schedule it.
is there any better option other than this?
Thanks,
However you do it, you'll be creating load jobs. So from the BigQuery side of things, it doesn't really matter which option you choose.
As far as scheduling goes, you do have some options on Google Cloud Platform:
App Engine standard environment cron service.
See this example for using this to reliably schedule tasks via Pub/Sub.
Your operating system's cron or systemd timers on a Compute Engine instance.
A cron job on a Kubernetes cluster, using Container Engine.
There are a few differences:
a) BQ Load:
-You can have some issues using special chars as delimiters, like ^ and |.
-You don't need a service account (You can use a user account)
-You can't use it on google cloud functions
b) API
-You don't have the special chars trouble.
-You can use it on google cloud functions
-And if you create a python script, you can schedule it on Scheduled Tasks (On Windows)

Inserting into BigQuery via load jobs (not streaming)

I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).