I want to get Bigquery's job id from BigQueryOperator.
I saw in bigquery_operator.py file the following line:
context['task_instance'].xcom_push(key='job_id', value=job_id)
I don't know if this is airflow's job id or BigQuery job id, if it's BigQuery job id how can I get it using xcom from downstream task?.
I tried to do the following in downstream Pythonoperator:
def write_statistics(**kwargs):
job_id = kwargs['templates_dict']['job_id']
print('tamir')
print(kwargs['ti'].xcom_pull(task_ids='create_tmp_big_query_table',key='job_id'))
print(kwargs['ti'])
print(job_id)
t3 = BigQueryOperator(
task_id='create_tmp_big_query_table',
bigquery_conn_id='bigquery_default',
destination_dataset_table= DATASET_TABLE_NAME,
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
sql = """
#standardSQL...
The UI is great for checking whether an XCom was written to or not, which I'd recommend you do even before you try to reference it in a separate task so you don't need to worry about whether you're fetching it correctly or not. Click your create_tmp_big_query_table task -> Task Instance Details -> XCom. It'll look something like the following:
In your case, the code looks right to me, but I'm guessing your version of Airflow doesn't have the change that added saving job id into an XCom. This feature was added in https://github.com/apache/airflow/pull/5195, which is currently only on master and currently not part of the most recent stable release (1.10.3). See for yourself in the 1.10.3 version of the BigQueryOperator.
Your options are to wait for it to be in a release (...sometimes takes awhile), running off a version of master with that change, or temporarily copy over the newer version of the operator as a custom operator. In the last case, I'd suggest naming it something like BigQueryOperatorWithXcom with a note to replace it with the built-in operator once it's released.
The JOB ID within bigquery_operator.py is the BQ JOB ID. You can understand it looking at the previous lines:
if isinstance(self.sql, str):
job_id = self.bq_cursor.run_query(
sql=self.sql,
destination_dataset_table=self.destination_dataset_table,
write_disposition=self.write_disposition,
allow_large_results=self.allow_large_results,
flatten_results=self.flatten_results,
udf_config=self.udf_config,
maximum_billing_tier=self.maximum_billing_tier,
maximum_bytes_billed=self.maximum_bytes_billed,
create_disposition=self.create_disposition,
query_params=self.query_params,
labels=self.labels,
schema_update_options=self.schema_update_options,
priority=self.priority,
time_partitioning=self.time_partitioning,
api_resource_configs=self.api_resource_configs,
cluster_fields=self.cluster_fields,
encryption_configuration=self.encryption_configuration
)
elif isinstance(self.sql, Iterable):
job_id = [
self.bq_cursor.run_query(
sql=s,
destination_dataset_table=self.destination_dataset_table,
write_disposition=self.write_disposition,
allow_large_results=self.allow_large_results,
flatten_results=self.flatten_results,
udf_config=self.udf_config,
maximum_billing_tier=self.maximum_billing_tier,
maximum_bytes_billed=self.maximum_bytes_billed,
create_disposition=self.create_disposition,
query_params=self.query_params,
labels=self.labels,
schema_update_options=self.schema_update_options,
priority=self.priority,
time_partitioning=self.time_partitioning,
api_resource_configs=self.api_resource_configs,
cluster_fields=self.cluster_fields,
encryption_configuration=self.encryption_configuration
)
for s in self.sql]
Eventually, run_with_configuration method returns self.running_job_id from BQ
Related
My goal is to load a cache when there is new data available. Data is loaded into the source table once a day but at an unpredictable time.
I've been trying to set up a data availability trigger VDP scheduler job like described in this Denodo community post:
https://community.denodo.com/answers/question/details?questionId=9060g0000004FOtAAM&title=Run+Scheduler+Job+Based+on+Value+from+a+Query
The post describes creating a scheduler job to fail whenever the condition is not satisfied. Now the only way I've found to force an error on certain conditions is to just use (1/0) and this doesn't always work for some reason. I was wondering if there is way to do this with a function like in normal SQL, couldn't find anything in the Denodo documentation.
This is what my code currently looks like:
--Trigger job
SELECT CASE
WHEN (
data_in_cache = current_data
)
THEN 1 % 0
ELSE 1
END
FROM database.table;
The cache job waits for the trigger job to be successful so the cache will only load when the data in the cache is outdated. This doesn't always work even though I feel it should.
Hoping someone has a function or line of VQL to make Denodo scheduler VDP job result in an error.
This would be easy by creating a custom function that, when executed, just throws an Exception. It doesn't need to be an Exception, you could create your own Exception to see it in the error trace. In any case, it could be something like this...
#CustomElement(type = CustomElementType.VDPFUNCTION, name = "ERROR_SAMPLE_FUNCTION")
public class ErrorSampleVdpFunction {
#CustomExecutor
public CustomArrayValue errorSampleFunction() throws Exception {
throw new Exception("This is an error");
}
}
So you will use it like:
--Trigger job SELECT CASE WHEN ( data_in_cache = current_data ) THEN errorSampleFunction() ELSE 1 END FROM database.table;
I have a few jobs, say one is loading a text file from a google cloud storage bucket to bigquery table, and another one is a scheduled query to copy data from one table to another table with some transformation, I want the second job to depend on the success of the first one, how do we achieve this in bigquery if it is possible to do so at all?
Many thanks.
Best regards,
Right now a developer needs to put together the chain of operations.
It can be done either using Cloud Functions (supports, Node.js, Go, Python) or via Cloud Run container (supports gcloud API, any programming language).
Basically you need to
issue a job
get the job id
poll for the job id
job is finished trigger other steps
If using Cloud Functions
place the file into a dedicated GCS bucket
setup a GCF that monitors that bucket and when a new file is uploaded it will execute a function that imports into GCS - wait until the operations ends
at the end of the GCF you can trigger other functions for next step
another use case with Cloud Functions:
A: a trigger starts the GCF
B: function executes the query (copy data to another table)
C: gets a job id - fires another function with a bit of delay
I: a function gets a jobid
J: polls for job is ready?
K: if not ready, fires himself again with a bit of delay
L: if ready triggers next step - could be a dedicated function or parameterized function
It is possible to address your scenario with either cloud functions(CF) or with a scheduler (airflow). The first approach is event-driven getting your data crunch immediately. With the scheduler, expect data availability delay.
As it has been stated once you submit BigQuery job you get back job ID, that needs to be check till it completes. Then based on the status you can handle on success or failure post actions respectively.
If you were to develop CF, note that there are certain limitations like execution time (max 9min), which you would have to address in case BigQuery job takes more than 9 min to complete. Another challenge with CF is idempotency, making sure that if the same datafile event comes more than once, the processing should not result in data duplicates.
Alternatively, you can consider using some event-driven serverless open source projects like BqTail - Google Cloud Storage BigQuery Loader with post-load transformation.
Here is an example of the bqtail rule.
rule.yaml
When:
Prefix: "/mypath/mysubpath"
Suffix: ".json"
Async: true
Batch:
Window:
DurationInSec: 85
Dest:
Table: bqtail.transactions
Transient:
Dataset: temp
Alias: t
Transform:
charge: (CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)
SideInputs:
- Table: bqtail.fees
Alias: f
'On': t.fee_id = f.id
OnSuccess:
- Action: query
Request:
SQL: SELECT
DATE(timestamp) AS date,
sku_id,
supply_entity_id,
MAX($EventID) AS batch_id,
SUM( payment) payment,
SUM((CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)) charge,
SUM(COALESCE(qty, 1.0)) AS qty
FROM $TempTable t
LEFT JOIN bqtail.fees f ON f.id = t.fee_id
GROUP BY 1, 2, 3
Dest: bqtail.supply_performance
Append: true
OnFailure:
- Action: notify
Request:
Channels:
- "#e2e"
Title: Failed to aggregate data to supply_performance
Message: "$Error"
OnSuccess:
- Action: query
Request:
SQL: SELECT CURRENT_TIMESTAMP() AS timestamp, $EventID AS job_id
Dest: bqtail.supply_performance_batches
Append: true
- Action: delete
You want to use an orchestration tool, especially if you want to set up this tasks as recurring jobs.
We use Google Cloud Composer, which is a managed service based on Airflow, to do workflow orchestration and works great. It comes with automatically retry, monitoring, alerting, and much more.
You might want to give it a try.
Basically you can use Cloud Logging to know almost all kinds of operations in GCP.
BigQuery is no exception. When the query job completed, you can find the corresponding log in the log viewer.
The next question is how to anchor the exact query you want, one way to achieve this is to use labeled query (means attach labels to your query) [1].
For example, you can use below bq command to issue query with foo:bar label
bq query \
--nouse_legacy_sql \
--label foo:bar \
'SELECT COUNT(*) FROM `bigquery-public-data`.samples.shakespeare'
Then, when you go to Logs Viewer and issue below log filter, you will find the exactly log generated by above query.
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.labels.foo="bar"
The next question is how to emit an event based on this log for the next workload. Then, the Cloud Pub/Sub comes into play.
2 ways to publish an event based on log pattern are:
Log Routers: set Pub/Sub topic as the destination [1]
Log-based Metrics: create alert policy whose notification channel is Pub/Sub [2]
So, the next workload can subscribe to the Pub/Sub topic, and be triggered when the previous query has completed.
Hope this helps ~
[1] https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfiguration
[2] https://cloud.google.com/logging/docs/routing/overview
[3] https://cloud.google.com/logging/docs/logs-based-metrics
Whenever a file is written to Cloud Storage, I want it to trigger a Cloud Function that executes a DataFlow template to transform the file content and write the results to BigQuery.
I think I got a handle that much for the most part. But the problem is that I don't need to just insert into a BQ table, I need to upsert (using the Merge operation). This seems like it would be a common requirement, but the Apache Beam BQ connector doesn't offer this option (only write, create and truncate/write).
So then I thought... OK, if I can just capture when the DataFlow pipeline is done executing, I could have DataFlow write to a temporary table and then I could call a SQL Merge query to merge data from the temp table to the target table. However, I'm not seeing any way to trigger a cloud function upon pipeline execution completion.
Any suggestions on how to accomplish the end goal?
Thanks
There is no native built in solution to generate an event at the end of Dataflow job. However, you can cheat thanks to the logs.
For this:
Go to logs, select advanced filter (arrow on the right of the filter bar) and paste this custom filter:
resource.type="dataflow_step" textPayload="Worker pool stopped."
You should see only your end of dataflow. Then, you have to create a sink into PubSub of this result. Then, you have to plug your function on these PubSub messages and you can do what you want.
For this, after having filling up your custom filter
Click on create sink
Set a sink name
Set the destination to PubSub
Select your topic
Now, plug a function on this topic, it will be trigger only at the end of dataflow.
I have implemented the exact use case, but instead of using 2 different pipeline, you can just create 1 pipeline.
Step 1: Read file from gcs and convert it into TableRow.
Step 2: Read the entire row from BigQuery.
Step 3: Create 1 pardo where you have your custom upsert operation like below code.
PCollection<KV<String,TableRow>> val = p.apply(BigQueryIO.readTableRows().from(""));
PCollection<KV<String,TableRow>> val1 = p.apply(TextIO.read().from("")).apply(Convert to TableRow()));
Step 4: Perform CoGroupByKey and perform pardo on top of that result to get the updated one(equivalent to MERGE OPERATION).
Step 5: Insert the complete TableRow to BQ using WRITE_TRUNCATE mode.
Here the code part would be little bit complicate, but that would perform better using single pipeline.
Interesting question, some good ideas already but I'd like to show another possibility with just Dataflow and BigQuery. If this is a non-templated Batch job we can use PipelineResult.waitUntilFinish() which:
Waits until the pipeline finishes and returns the final status.
Then we check if State is DONE and proceed with the MERGE statement if needed:
PipelineResult res = p.run();
res.waitUntilFinish();
if (res.getState() == PipelineResult.State.DONE) {
LOG.info("Dataflow job is finished. Merging results...");
MergeResults();
LOG.info("All done :)");
}
In order to test this we can create a BigQuery table (upsert.full) which will contain the final results and be updated each run:
bq mk upsert
bq mk -t upsert.full name:STRING,total:INT64
bq query --use_legacy_sql=false "INSERT upsert.full (name, total) VALUES('tv', 10), ('laptop', 20)"
at the start we'll populate it with a total of 10 TVs. But now let's imagine that we sell 5 extra TVs and, in our Dataflow job, we'll write a single row to a temporary table (upsert.temp) with the new corrected value (15):
p
.apply("Create Data", Create.of("Start"))
.apply("Write", BigQueryIO
.<String>write()
.to(output)
.withFormatFunction(
(String dummy) ->
new TableRow().set("name", "tv").set("total", 15))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withSchema(schema));
So now we want to update the original table with the following query (DML syntax):
MERGE upsert.full F
USING upsert.temp T
ON T.name = F.name
WHEN MATCHED THEN
UPDATE SET total = T.total
WHEN NOT MATCHED THEN
INSERT(name, total)
VALUES(name, total)
Therefore, we can use BigQuery's Java Client Library in MergeResults:
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(
"MERGE upsert.full F "
+ ...
+ "VALUES(name, total)")
.setUseLegacySql(false)
.build();
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());
This is based on this snippet which includes some basic error handling. Note that you'll need to add this to your pom.xml or equivalent:
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-bigquery</artifactId>
<version>1.82.0</version>
</dependency>
and it works for me:
INFO: 2020-02-08T11:38:56.292Z: Worker pool stopped.
Feb 08, 2020 12:39:04 PM org.apache.beam.runners.dataflow.DataflowPipelineJob logTerminalState
INFO: Job 2020-02-08_REDACTED finished with status DONE.
Feb 08, 2020 12:39:04 PM org.apache.beam.examples.BigQueryUpsert main
INFO: Dataflow job is finished. Merging results...
Feb 08, 2020 12:39:09 PM org.apache.beam.examples.BigQueryUpsert main
INFO: All done :)
$ bq query --use_legacy_sql=false "SELECT name,total FROM upsert.full LIMIT 10"
+--------+-------+
| name | total |
+--------+-------+
| tv | 15 |
| laptop | 20 |
+--------+-------+
Tested with the 2.17.0 Java SDK and both the Direct and Dataflow runners.
Full example here
I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.
Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.
From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):
private void executeQuery(
String executingProject,
String jobId,
TableReference destinationTable,
JobService jobService) throws IOException, InterruptedException {
JobReference jobRef = new JobReference()
.setProjectId(executingProject)
.setJobId(jobId);
JobConfigurationQuery queryConfig = createBasicQueryConfig()
.setAllowLargeResults(true)
.setCreateDisposition("CREATE_IF_NEEDED")
.setDestinationTable(destinationTable)
.setPriority("BATCH") <-- NOT EXPOSED
.setWriteDisposition("WRITE_EMPTY");
jobService.startQueryJob(jobRef, queryConfig);
Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
if (parseStatus(job) != Status.SUCCEEDED) {
throw new IOException(String.format(
"Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
}
}
If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:
Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
Write the results of step 1 to a temp table
In your Beam pipeline, read the temp table using BigQueryIO.Read.from()
You can configure to run the queries with "Interactive" priority by passing a priority parameter. Check this Github example for details.
Please note that you might be reaching one of the BigQuery limits and quotas as when you use batch, if you ever hit a rate limit, the query will be queued and retried later. As opposed to the interactive ones, when if these limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.
How to get below mentioned job information by using Jenkins api or some other command line option.
Time-stamp of last job that succeeded.
Time-stamp of last job that failed.
I looked in to this API but its giving only build info but not the time stamp i.e at what time and date information when this build failed of succeeded.
http://javadoc.jenkins-ci.org/hudson/model/Job.html
You do this by using the method getLastSuccessfulBuild() and getLastFailedBuild() on Job and then asking each one for their timestamp. E.g. there is no methods for doing this directly on the Job node, instead you need multiple methods.
So, using, for example, the XML API, it would look something like this:
https://<JENKINS_URL>/job/<JOB_NAME>/api/xml?tree=lastSuccessfulBuild[timestamp],lastFailedBuild[timestamp]
In my case this gives me:
<freeStyleProject _class="hudson.model.FreeStyleProject">
<lastFailedBuild _class="hudson.model.FreeStyleBuild">
<timestamp>1484291786712</timestamp>
</lastFailedBuild>
<lastSuccessfulBuild _class="hudson.model.FreeStyleBuild">
<timestamp>1486285440897</timestamp>
</lastSuccessfulBuild>
</freeStyleProject>