Handling RuntimeException errors in a BigQuery pipeline - google-bigquery

When we use a BigQueryIO transform to insert rows, we have an option called:
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
which instructs the pipeline to NOT attempt to create the table if the table doesn't already exist. In my scenario, I want to trap all errors. I attempted to use the following:
var write=mypipline.apply("Write table", BigQueryIO
.<Employee>write()
.to(targetTableName_notpresent)
.withExtendedErrorInfo()
.withFormatFunction(new EmployeeToTableRow())
.withSchema(schema)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withTableDescription("My Test Table")
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
which tried to insert rows into a non-existent table. What I found was a RuntimeException. Where I am stuck is that I don't know how to handle RuntimeException problems. I don't believe there is anything here I can surround with a try/catch.
This question is similar to this one:
Is it possible to catch a missing dataset java.lang.RuntimeException in a Google Cloud Dataflow pipeline that writes from Pub/Sub to BigQuery?
but I don't think that got a working answer and was focused on a missing Dataset and not a table.
My exception from the fragment above is:
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://bigquery.googleapis.com/bigquery/v2/projects/XXXX/datasets/jupyter/tables/not_here/insertAll?prettyPrint=false
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table XXXX:jupyter.not_here",
"reason" : "notFound"
} ],
"message" : "Not found: Table XXXX:jupyter.not_here",
"status" : "NOT_FOUND"
}
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at .(#126:1)

You can't add a try/catch directly from the BigQueryIO in the Beam job, if the destination table doesn't exist.
I think it's better to delegate this responsability outside of Beam or launch the job only if your table exists.
Usually a tool like Terraform has the responsability to create infrastructure, before to deploy resources and run Beam jobs.
If it's mandatory for you to check the existence of the table, you can create :
A Shell script with bq and gcloud cli to check the existence before to launch the job
A Python script to check the existence before to launch the job
Python script :
For Python there is the BigQuery Python client :
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to determine existence.
# table_id = "your-project.your_dataset.your_table"
try:
client.get_table(table_id) # Make an API request.
print("Table {} already exists.".format(table_id))
except NotFound:
print("Table {} is not found.".format(table_id))
BQ Shell script :
bq show <project_id>:<dataset_id>.<table_id>
If the table doesn't exist, catch the error and do not start the Dataflow job.

Related

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another.
The job fails with the error:
Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error:
Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].
I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene.
The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one.
Any idea of how to work with this issue ?
Here's the code:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'
with beam.Pipeline(options=options) as p:
query = "SELECT ..."
bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)
table_schema = ...
bq_data | beam.io.WriteToBigQuery(
project="prj",
dataset="test",
table="test",
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.
So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.
This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().
You can specify the dataset to use. That way the process doesn't create a temp dataset.
Example:
TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
.fromQuery("selectQuery").withQueryTempDataset("existingDataset")
.usingStandardSql().withMethod(TypedRead.Method.DEFAULT);

Dataflow insert into BigQuery fails with large number of files for asia-northeast1 location

I am using Cloud Storage Text to BigQuery template on Cloud Composer.
The template is kicked from Python google api client.
The same program
works fine in US location (for Dataflow and BigQuery).
fails in asia-northeast1 location.
works fine with the fewer (less than 10000) input files in asia-northeast location.
Does anybody have an idea about this?
I want to execute in the asia-northeast location for business reason.
More details about failure:
The program worked until "ReifyRenameInput", and the failed .
dataflow job failed
with the error message below:
java.io.IOException: Unable to insert job: beam_load_textiotobigquerydataflow0releaser0806214711ca282fc3_8fca2422ccd74649b984a625f246295c_2a18c21953c26c4d4da2f8f0850da0d2_00000-0, aborting after 9 .
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startJob(BigQueryServicesImpl.java:231)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startJob(BigQueryServicesImpl.java:202)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startCopyJob(BigQueryServicesImpl.java:196)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.copy(WriteRename.java:144)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.writeRename(WriteRename.java:107)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.processElement(WriteRename.java:80)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException:
404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Dataset pj:datasetname", "reason" : "notFound" } ], "message" : "Not found: Dataset pj:datasetname" }
(pj and dataset name are not real name, and they are project name and dataset name for outputTable parameter)
Although the error message said the dataset is not found, the dataset surely existed.
Moreover, some new tables which seems to be tempory tables were created in the dataset after the program.
This is a known issue related to your Beam SDK version according to this public issue tracker. The Beam 2.5.0 SDK version doesn't have this issue.

"Invalid schema update" error when loading data using autodetect

Let's say I have a table with one single field named "version", which is a string. When I try to load data into the table using autodetect with values like "1.1" or "1", the autodetect feature infers these values as float or integer type respectively.
data1.json example:
{ "version": "1.11.0" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data1.json
Upload complete.
Waiting on bqjob_ZZZ ... (1s) Current status: DONE
data2.json example:
{ "version": "1.11" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data2.json
Upload complete.
Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to FLOAT
data3.json example:
{ "version": "1" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data3.json
Upload complete.
Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to INTEGER
The scenario where this problem doesn't happen is when you have, in the same file, another JSON where the value is inferred correctly as string (as seen in Bigquery autoconverting fields in data question):
{ "version": "1.12" }
{ "version": "1.12.0" }
In the question listed above, there's an answer stating that a fix was pushed to production, but it looks like the bug is back again. Is there a way/workaround to prevent this?
Looks like the confusing part here is whether "1.12" should be detected as string or float. BigQuery chose to detect as float. Before autodetect is introduced in BigQuery, BigQuery allows users to load float values in string format. This is very common in CSV/JSON format. So when autodetect is introduced, it kept this behavior. Autodetect will scan up to 100 rows to detect the type. If for all 100 rows, the data is like "1.12", then very likely this field is a float value. If one of the row has value "1.12.0", then BigQuery will detect the type is string, as you have observed.

Google Dataflow stalled after BigQuery outage

I have a Google Dataflow Job running. The dataflow job is reading messages from Pub/Sub, enrich it and write the enriched data into BigQuery.
Dataflow was processing approximately 5000 messages per second. I am using 20 workers to run the dataflow job.
Yesterday it seems there was a BigQuery outage. So writing the data in BigQuery part failed. After some time, my dataflow stopped working.
I see 1000 errors like below
(7dd47a65ad656a43): Exception: java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "The project xx-xxxxxx-xxxxxx has not enabled BigQuery.",
"reason" : "invalid"
} ],
"message" : "The project xx-xxxxxx-xxxxxx has not enabled BigQuery.",
"status" : "INVALID_ARGUMENT"
}
com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:285)
com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:175)
com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2728)
com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2685)
com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:159)
com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:194)
com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:719)
Stack trace truncated. Please see Cloud Logging for the entire trace.
Please note that dataflow is not working even the BigQuery started working. I had to restart the dataflow job to make it work.
This causes data loss. Not only at the time of outage, but also until I notice the error and restart dataflow job. Is there a way to configure the retry option so that dataflow job does not go into stale on these cases?

com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found

When using a Talend bigquery input component (BQ java api) to read from bigquery, I get the following error (for a long running job) -
Exception in component tBigQueryInput_4
com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table rand-cap:_f000fcf374688fc5e7da50a4c0c04ba228d993c3.anon0849eba05949a62962f218a0433d6ee82bf13a7b",
"reason" : "notFound"
} ],
"message" : "Not found: Table rand-cap:_f000fcf374688fc5e7da50a4c0c04ba228d993c3.anon0849eba05949a62962f218a0433d6ee82bf13a7b"
}
Is this because of the "temporary" table that bq creates when querying results not being available after 24hrs. Or is it because rate limit was exceeded since I am querying a large table ?
In either case, how can I find more details on this error and what steps should I take to prevent this ?
Thank you !
This seems to be a problem in Talend, there are other users describing your issue: https://www.talendforge.org/forum/viewtopic.php?id=44734
Google Bigquery has a property i.e. Allowlargeresults but its not there in TBigqueryinput.
Hi there - I am currently using Talend open studio v6.1.1 and this issue still exists.