How to set up job dependencies in google bigquery? - google-bigquery

I have a few jobs, say one is loading a text file from a google cloud storage bucket to bigquery table, and another one is a scheduled query to copy data from one table to another table with some transformation, I want the second job to depend on the success of the first one, how do we achieve this in bigquery if it is possible to do so at all?
Many thanks.
Best regards,

Right now a developer needs to put together the chain of operations.
It can be done either using Cloud Functions (supports, Node.js, Go, Python) or via Cloud Run container (supports gcloud API, any programming language).
Basically you need to
issue a job
get the job id
poll for the job id
job is finished trigger other steps
If using Cloud Functions
place the file into a dedicated GCS bucket
setup a GCF that monitors that bucket and when a new file is uploaded it will execute a function that imports into GCS - wait until the operations ends
at the end of the GCF you can trigger other functions for next step
another use case with Cloud Functions:
A: a trigger starts the GCF
B: function executes the query (copy data to another table)
C: gets a job id - fires another function with a bit of delay
I: a function gets a jobid
J: polls for job is ready?
K: if not ready, fires himself again with a bit of delay
L: if ready triggers next step - could be a dedicated function or parameterized function

It is possible to address your scenario with either cloud functions(CF) or with a scheduler (airflow). The first approach is event-driven getting your data crunch immediately. With the scheduler, expect data availability delay.
As it has been stated once you submit BigQuery job you get back job ID, that needs to be check till it completes. Then based on the status you can handle on success or failure post actions respectively.
If you were to develop CF, note that there are certain limitations like execution time (max 9min), which you would have to address in case BigQuery job takes more than 9 min to complete. Another challenge with CF is idempotency, making sure that if the same datafile event comes more than once, the processing should not result in data duplicates.
Alternatively, you can consider using some event-driven serverless open source projects like BqTail - Google Cloud Storage BigQuery Loader with post-load transformation.
Here is an example of the bqtail rule.
rule.yaml
When:
Prefix: "/mypath/mysubpath"
Suffix: ".json"
Async: true
Batch:
Window:
DurationInSec: 85
Dest:
Table: bqtail.transactions
Transient:
Dataset: temp
Alias: t
Transform:
charge: (CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)
SideInputs:
- Table: bqtail.fees
Alias: f
'On': t.fee_id = f.id
OnSuccess:
- Action: query
Request:
SQL: SELECT
DATE(timestamp) AS date,
sku_id,
supply_entity_id,
MAX($EventID) AS batch_id,
SUM( payment) payment,
SUM((CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)) charge,
SUM(COALESCE(qty, 1.0)) AS qty
FROM $TempTable t
LEFT JOIN bqtail.fees f ON f.id = t.fee_id
GROUP BY 1, 2, 3
Dest: bqtail.supply_performance
Append: true
OnFailure:
- Action: notify
Request:
Channels:
- "#e2e"
Title: Failed to aggregate data to supply_performance
Message: "$Error"
OnSuccess:
- Action: query
Request:
SQL: SELECT CURRENT_TIMESTAMP() AS timestamp, $EventID AS job_id
Dest: bqtail.supply_performance_batches
Append: true
- Action: delete

You want to use an orchestration tool, especially if you want to set up this tasks as recurring jobs.
We use Google Cloud Composer, which is a managed service based on Airflow, to do workflow orchestration and works great. It comes with automatically retry, monitoring, alerting, and much more.
You might want to give it a try.

Basically you can use Cloud Logging to know almost all kinds of operations in GCP.
BigQuery is no exception. When the query job completed, you can find the corresponding log in the log viewer.
The next question is how to anchor the exact query you want, one way to achieve this is to use labeled query (means attach labels to your query) [1].
For example, you can use below bq command to issue query with foo:bar label
bq query \
--nouse_legacy_sql \
--label foo:bar \
'SELECT COUNT(*) FROM `bigquery-public-data`.samples.shakespeare'
Then, when you go to Logs Viewer and issue below log filter, you will find the exactly log generated by above query.
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.labels.foo="bar"
The next question is how to emit an event based on this log for the next workload. Then, the Cloud Pub/Sub comes into play.
2 ways to publish an event based on log pattern are:
Log Routers: set Pub/Sub topic as the destination [1]
Log-based Metrics: create alert policy whose notification channel is Pub/Sub [2]
So, the next workload can subscribe to the Pub/Sub topic, and be triggered when the previous query has completed.
Hope this helps ~
[1] https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfiguration
[2] https://cloud.google.com/logging/docs/routing/overview
[3] https://cloud.google.com/logging/docs/logs-based-metrics

Related

Am I able to view a list of devices by partition in iothub?

I have 2 nodes of a cluster receiving messages from iothub. I split their responsibility by partition. Node 1 reads from partitions 1,3,5,7,9 and the other 2,4,6,8, and 0. Recently, my partition 8 stops responding until I stop my code and restart it. It seems like a device is sending a message that locks up the partition. What I want to do is list all devices in my partition 8. Is that possible? Is there a cloud shell command to get those devices in a list?
Not sure this will help you, but you can see the partition on the incoming messages. For example you could use Azure Stream Analytics to see the partitions using this query:
Select GetMetadataPropertyValue(IoTHub, '[IoTHub].[ConnectionDeviceId]') as DeviceId, partitionId
from IoTHub
Also, if you run locally in VisualStudio it will tell you which device is sending malformed JSON. eg.
[Warning] 10/21/2021 9:12:54 AM : User Warning Source 'IoTHub' had 1 occurrences of kind 'InputDeserializerError.InvalidData' between processing times '2021-10-21T15:12:50.5076449Z' and '2021-10-21T15:12:50.5712076Z'. Could not deserialize the input event(s) from resource 'Partition: [1], Offset: [455266583232], SequenceNumber: [634800], DeviceId: [DeviceName]' as Json. Some possible reasons: 1) Malformed events 2) Input source configured with incorrect serialization format
Also check your "Activity Log" blade in the ASA job. It may have more details for you.

Cloud Storage to BigQuery (upsert) via DataFlow

Whenever a file is written to Cloud Storage, I want it to trigger a Cloud Function that executes a DataFlow template to transform the file content and write the results to BigQuery.
I think I got a handle that much for the most part. But the problem is that I don't need to just insert into a BQ table, I need to upsert (using the Merge operation). This seems like it would be a common requirement, but the Apache Beam BQ connector doesn't offer this option (only write, create and truncate/write).
So then I thought... OK, if I can just capture when the DataFlow pipeline is done executing, I could have DataFlow write to a temporary table and then I could call a SQL Merge query to merge data from the temp table to the target table. However, I'm not seeing any way to trigger a cloud function upon pipeline execution completion.
Any suggestions on how to accomplish the end goal?
Thanks
There is no native built in solution to generate an event at the end of Dataflow job. However, you can cheat thanks to the logs.
For this:
Go to logs, select advanced filter (arrow on the right of the filter bar) and paste this custom filter:
resource.type="dataflow_step" textPayload="Worker pool stopped."
You should see only your end of dataflow. Then, you have to create a sink into PubSub of this result. Then, you have to plug your function on these PubSub messages and you can do what you want.
For this, after having filling up your custom filter
Click on create sink
Set a sink name
Set the destination to PubSub
Select your topic
Now, plug a function on this topic, it will be trigger only at the end of dataflow.
I have implemented the exact use case, but instead of using 2 different pipeline, you can just create 1 pipeline.
Step 1: Read file from gcs and convert it into TableRow.
Step 2: Read the entire row from BigQuery.
Step 3: Create 1 pardo where you have your custom upsert operation like below code.
PCollection<KV<String,TableRow>> val = p.apply(BigQueryIO.readTableRows().from(""));
PCollection<KV<String,TableRow>> val1 = p.apply(TextIO.read().from("")).apply(Convert to TableRow()));
Step 4: Perform CoGroupByKey and perform pardo on top of that result to get the updated one(equivalent to MERGE OPERATION).
Step 5: Insert the complete TableRow to BQ using WRITE_TRUNCATE mode.
Here the code part would be little bit complicate, but that would perform better using single pipeline.
Interesting question, some good ideas already but I'd like to show another possibility with just Dataflow and BigQuery. If this is a non-templated Batch job we can use PipelineResult.waitUntilFinish() which:
Waits until the pipeline finishes and returns the final status.
Then we check if State is DONE and proceed with the MERGE statement if needed:
PipelineResult res = p.run();
res.waitUntilFinish();
if (res.getState() == PipelineResult.State.DONE) {
LOG.info("Dataflow job is finished. Merging results...");
MergeResults();
LOG.info("All done :)");
}
In order to test this we can create a BigQuery table (upsert.full) which will contain the final results and be updated each run:
bq mk upsert
bq mk -t upsert.full name:STRING,total:INT64
bq query --use_legacy_sql=false "INSERT upsert.full (name, total) VALUES('tv', 10), ('laptop', 20)"
at the start we'll populate it with a total of 10 TVs. But now let's imagine that we sell 5 extra TVs and, in our Dataflow job, we'll write a single row to a temporary table (upsert.temp) with the new corrected value (15):
p
.apply("Create Data", Create.of("Start"))
.apply("Write", BigQueryIO
.<String>write()
.to(output)
.withFormatFunction(
(String dummy) ->
new TableRow().set("name", "tv").set("total", 15))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withSchema(schema));
So now we want to update the original table with the following query (DML syntax):
MERGE upsert.full F
USING upsert.temp T
ON T.name = F.name
WHEN MATCHED THEN
UPDATE SET total = T.total
WHEN NOT MATCHED THEN
INSERT(name, total)
VALUES(name, total)
Therefore, we can use BigQuery's Java Client Library in MergeResults:
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(
"MERGE upsert.full F "
+ ...
+ "VALUES(name, total)")
.setUseLegacySql(false)
.build();
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());
This is based on this snippet which includes some basic error handling. Note that you'll need to add this to your pom.xml or equivalent:
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-bigquery</artifactId>
<version>1.82.0</version>
</dependency>
and it works for me:
INFO: 2020-02-08T11:38:56.292Z: Worker pool stopped.
Feb 08, 2020 12:39:04 PM org.apache.beam.runners.dataflow.DataflowPipelineJob logTerminalState
INFO: Job 2020-02-08_REDACTED finished with status DONE.
Feb 08, 2020 12:39:04 PM org.apache.beam.examples.BigQueryUpsert main
INFO: Dataflow job is finished. Merging results...
Feb 08, 2020 12:39:09 PM org.apache.beam.examples.BigQueryUpsert main
INFO: All done :)
$ bq query --use_legacy_sql=false "SELECT name,total FROM upsert.full LIMIT 10"
+--------+-------+
| name | total |
+--------+-------+
| tv | 15 |
| laptop | 20 |
+--------+-------+
Tested with the 2.17.0 Java SDK and both the Direct and Dataflow runners.
Full example here

getting running job id from BigQueryOperator using xcom

I want to get Bigquery's job id from BigQueryOperator.
I saw in bigquery_operator.py file the following line:
context['task_instance'].xcom_push(key='job_id', value=job_id)
I don't know if this is airflow's job id or BigQuery job id, if it's BigQuery job id how can I get it using xcom from downstream task?.
I tried to do the following in downstream Pythonoperator:
def write_statistics(**kwargs):
job_id = kwargs['templates_dict']['job_id']
print('tamir')
print(kwargs['ti'].xcom_pull(task_ids='create_tmp_big_query_table',key='job_id'))
print(kwargs['ti'])
print(job_id)
t3 = BigQueryOperator(
task_id='create_tmp_big_query_table',
bigquery_conn_id='bigquery_default',
destination_dataset_table= DATASET_TABLE_NAME,
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
sql = """
#standardSQL...
The UI is great for checking whether an XCom was written to or not, which I'd recommend you do even before you try to reference it in a separate task so you don't need to worry about whether you're fetching it correctly or not. Click your create_tmp_big_query_table task -> Task Instance Details -> XCom. It'll look something like the following:
In your case, the code looks right to me, but I'm guessing your version of Airflow doesn't have the change that added saving job id into an XCom. This feature was added in https://github.com/apache/airflow/pull/5195, which is currently only on master and currently not part of the most recent stable release (1.10.3). See for yourself in the 1.10.3 version of the BigQueryOperator.
Your options are to wait for it to be in a release (...sometimes takes awhile), running off a version of master with that change, or temporarily copy over the newer version of the operator as a custom operator. In the last case, I'd suggest naming it something like BigQueryOperatorWithXcom with a note to replace it with the built-in operator once it's released.
The JOB ID within bigquery_operator.py is the BQ JOB ID. You can understand it looking at the previous lines:
if isinstance(self.sql, str):
job_id = self.bq_cursor.run_query(
sql=self.sql,
destination_dataset_table=self.destination_dataset_table,
write_disposition=self.write_disposition,
allow_large_results=self.allow_large_results,
flatten_results=self.flatten_results,
udf_config=self.udf_config,
maximum_billing_tier=self.maximum_billing_tier,
maximum_bytes_billed=self.maximum_bytes_billed,
create_disposition=self.create_disposition,
query_params=self.query_params,
labels=self.labels,
schema_update_options=self.schema_update_options,
priority=self.priority,
time_partitioning=self.time_partitioning,
api_resource_configs=self.api_resource_configs,
cluster_fields=self.cluster_fields,
encryption_configuration=self.encryption_configuration
)
elif isinstance(self.sql, Iterable):
job_id = [
self.bq_cursor.run_query(
sql=s,
destination_dataset_table=self.destination_dataset_table,
write_disposition=self.write_disposition,
allow_large_results=self.allow_large_results,
flatten_results=self.flatten_results,
udf_config=self.udf_config,
maximum_billing_tier=self.maximum_billing_tier,
maximum_bytes_billed=self.maximum_bytes_billed,
create_disposition=self.create_disposition,
query_params=self.query_params,
labels=self.labels,
schema_update_options=self.schema_update_options,
priority=self.priority,
time_partitioning=self.time_partitioning,
api_resource_configs=self.api_resource_configs,
cluster_fields=self.cluster_fields,
encryption_configuration=self.encryption_configuration
)
for s in self.sql]
Eventually, run_with_configuration method returns self.running_job_id from BQ

Change Google Cloud Dataflow BigQuery Priority

I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.
Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.
From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):
private void executeQuery(
String executingProject,
String jobId,
TableReference destinationTable,
JobService jobService) throws IOException, InterruptedException {
JobReference jobRef = new JobReference()
.setProjectId(executingProject)
.setJobId(jobId);
JobConfigurationQuery queryConfig = createBasicQueryConfig()
.setAllowLargeResults(true)
.setCreateDisposition("CREATE_IF_NEEDED")
.setDestinationTable(destinationTable)
.setPriority("BATCH") <-- NOT EXPOSED
.setWriteDisposition("WRITE_EMPTY");
jobService.startQueryJob(jobRef, queryConfig);
Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
if (parseStatus(job) != Status.SUCCEEDED) {
throw new IOException(String.format(
"Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
}
}
If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:
Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
Write the results of step 1 to a temp table
In your Beam pipeline, read the temp table using BigQueryIO.Read.from()
You can configure to run the queries with "Interactive" priority by passing a priority parameter. Check this Github example for details.
Please note that you might be reaching one of the BigQuery limits and quotas as when you use batch, if you ever hit a rate limit, the query will be queued and retried later. As opposed to the interactive ones, when if these limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.

Salt Stack - Best way to get schedule output

I'm using saltstack on my servers. I did a simple schedule:
job1:
schedule.present:
- function: state.apply
- seconds: 1800
- splay: 5
Now I want to get the output of the schedule back on my master
(or on my minion but I just like to know the best way)
I don't really know how to use salt mine or returners or what is best preferred for my needs.
thank you :)
Job data return and job metadata
By default, data about jobs runs from the Salt scheduler is returned to the master.
It can therefore be useful to include specific data to differentiate a job from other jobs. By using the metadata parameter a value can be associated with a scheduled job, although these metadata values aren't used in the execution of the job they can be used to search for specific job later.
job1:
schedule.present:
- function: state.apply
- seconds: 1800
- splay: 5
- metadata:
foo: bar
An example job search on the salt-master might look similar to this:
salt-run jobs.list_jobs search_metadata='{"foo": "bar"}'
Take a look at the salt.runners.jobs documentation for more examples.
Scheduler with returner
The schedule.present function includes a parameter to set a returner to use to return the results of the scheduled job. You could for example use the pushover returner:
job1:
schedule.present:
- function: state.apply
- seconds: 1800
- splay: 5
- returner: pushover
Take a look at the returners documentation for a full list of of builtin returners and usage information.