Google Dataflow stalled after BigQuery outage - google-bigquery

I have a Google Dataflow Job running. The dataflow job is reading messages from Pub/Sub, enrich it and write the enriched data into BigQuery.
Dataflow was processing approximately 5000 messages per second. I am using 20 workers to run the dataflow job.
Yesterday it seems there was a BigQuery outage. So writing the data in BigQuery part failed. After some time, my dataflow stopped working.
I see 1000 errors like below
(7dd47a65ad656a43): Exception: java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "The project xx-xxxxxx-xxxxxx has not enabled BigQuery.",
"reason" : "invalid"
} ],
"message" : "The project xx-xxxxxx-xxxxxx has not enabled BigQuery.",
"status" : "INVALID_ARGUMENT"
}
com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:285)
com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:175)
com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2728)
com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2685)
com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:159)
com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:194)
com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:719)
Stack trace truncated. Please see Cloud Logging for the entire trace.
Please note that dataflow is not working even the BigQuery started working. I had to restart the dataflow job to make it work.
This causes data loss. Not only at the time of outage, but also until I notice the error and restart dataflow job. Is there a way to configure the retry option so that dataflow job does not go into stale on these cases?

Related

Handling RuntimeException errors in a BigQuery pipeline

When we use a BigQueryIO transform to insert rows, we have an option called:
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
which instructs the pipeline to NOT attempt to create the table if the table doesn't already exist. In my scenario, I want to trap all errors. I attempted to use the following:
var write=mypipline.apply("Write table", BigQueryIO
.<Employee>write()
.to(targetTableName_notpresent)
.withExtendedErrorInfo()
.withFormatFunction(new EmployeeToTableRow())
.withSchema(schema)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withTableDescription("My Test Table")
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
which tried to insert rows into a non-existent table. What I found was a RuntimeException. Where I am stuck is that I don't know how to handle RuntimeException problems. I don't believe there is anything here I can surround with a try/catch.
This question is similar to this one:
Is it possible to catch a missing dataset java.lang.RuntimeException in a Google Cloud Dataflow pipeline that writes from Pub/Sub to BigQuery?
but I don't think that got a working answer and was focused on a missing Dataset and not a table.
My exception from the fragment above is:
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://bigquery.googleapis.com/bigquery/v2/projects/XXXX/datasets/jupyter/tables/not_here/insertAll?prettyPrint=false
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table XXXX:jupyter.not_here",
"reason" : "notFound"
} ],
"message" : "Not found: Table XXXX:jupyter.not_here",
"status" : "NOT_FOUND"
}
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at .(#126:1)
You can't add a try/catch directly from the BigQueryIO in the Beam job, if the destination table doesn't exist.
I think it's better to delegate this responsability outside of Beam or launch the job only if your table exists.
Usually a tool like Terraform has the responsability to create infrastructure, before to deploy resources and run Beam jobs.
If it's mandatory for you to check the existence of the table, you can create :
A Shell script with bq and gcloud cli to check the existence before to launch the job
A Python script to check the existence before to launch the job
Python script :
For Python there is the BigQuery Python client :
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to determine existence.
# table_id = "your-project.your_dataset.your_table"
try:
client.get_table(table_id) # Make an API request.
print("Table {} already exists.".format(table_id))
except NotFound:
print("Table {} is not found.".format(table_id))
BQ Shell script :
bq show <project_id>:<dataset_id>.<table_id>
If the table doesn't exist, catch the error and do not start the Dataflow job.

Load from GCS to GBQ causes an internal BigQuery error

My application creates thousands of "load jobs" daily to load data from Google Cloud Storage URIs to BigQuery and only a few cases causing the error:
"Finished with errors. Detail: An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 7916072"
The application is written on Python and uses libraries:
google-cloud-storage==1.42.0
google-cloud-bigquery==2.24.1
google-api-python-client==2.37.0
Load job is done by calling
load_job = self._client.load_table_from_uri(
source_uris=source_uri,
destination=destination,
job_config=job_config,
)
this method has a default param:
retry: retries.Retry = DEFAULT_RETRY,
so the job should automatically retry on such errors.
Id of specific job that finished with error:
"load_job_id": "6005ab89-9edf-4767-aaf1-6383af5e04b6"
"load_job_location": "US"
after getting the error the application recreates the job, but it doesn't help.
Subsequent failed job ids:
5f43a466-14aa-48cc-a103-0cfb4e0188a2
43dc3943-4caa-4352-aa40-190a2f97d48d
43084fcd-9642-4516-8718-29b844e226b1
f25ba358-7b9d-455b-b5e5-9a498ab204f7
...
As mentioned in the error message, Wait according to the back-off requirements described in the BigQuery Service Level Agreement, then try the operation again.
If the error continues to occur, if you have a support plan please create a new GCP support case. Otherwise, you can open a new issue on the issue tracker describing your issue. You can also try to reduce the frequency of this error by using Reservations.
For more information about the error messages you can refer to this document.

Dataflow insert into BigQuery fails with large number of files for asia-northeast1 location

I am using Cloud Storage Text to BigQuery template on Cloud Composer.
The template is kicked from Python google api client.
The same program
works fine in US location (for Dataflow and BigQuery).
fails in asia-northeast1 location.
works fine with the fewer (less than 10000) input files in asia-northeast location.
Does anybody have an idea about this?
I want to execute in the asia-northeast location for business reason.
More details about failure:
The program worked until "ReifyRenameInput", and the failed .
dataflow job failed
with the error message below:
java.io.IOException: Unable to insert job: beam_load_textiotobigquerydataflow0releaser0806214711ca282fc3_8fca2422ccd74649b984a625f246295c_2a18c21953c26c4d4da2f8f0850da0d2_00000-0, aborting after 9 .
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startJob(BigQueryServicesImpl.java:231)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startJob(BigQueryServicesImpl.java:202)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startCopyJob(BigQueryServicesImpl.java:196)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.copy(WriteRename.java:144)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.writeRename(WriteRename.java:107)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.processElement(WriteRename.java:80)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException:
404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Dataset pj:datasetname", "reason" : "notFound" } ], "message" : "Not found: Dataset pj:datasetname" }
(pj and dataset name are not real name, and they are project name and dataset name for outputTable parameter)
Although the error message said the dataset is not found, the dataset surely existed.
Moreover, some new tables which seems to be tempory tables were created in the dataset after the program.
This is a known issue related to your Beam SDK version according to this public issue tracker. The Beam 2.5.0 SDK version doesn't have this issue.

com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found

When using a Talend bigquery input component (BQ java api) to read from bigquery, I get the following error (for a long running job) -
Exception in component tBigQueryInput_4
com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table rand-cap:_f000fcf374688fc5e7da50a4c0c04ba228d993c3.anon0849eba05949a62962f218a0433d6ee82bf13a7b",
"reason" : "notFound"
} ],
"message" : "Not found: Table rand-cap:_f000fcf374688fc5e7da50a4c0c04ba228d993c3.anon0849eba05949a62962f218a0433d6ee82bf13a7b"
}
Is this because of the "temporary" table that bq creates when querying results not being available after 24hrs. Or is it because rate limit was exceeded since I am querying a large table ?
In either case, how can I find more details on this error and what steps should I take to prevent this ?
Thank you !
This seems to be a problem in Talend, there are other users describing your issue: https://www.talendforge.org/forum/viewtopic.php?id=44734
Google Bigquery has a property i.e. Allowlargeresults but its not there in TBigqueryinput.
Hi there - I am currently using Talend open studio v6.1.1 and this issue still exists.

Frequent 503 errors raised from BigQuery Streaming API

Streaming data into BigQuery keeps failing due to the following error, which occurs more frequently recently:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
{
"code" : 503,
"errors" : [ {
"domain" : "global",
"message" : "Connection error. Please try again.",
"reason" : "backendError"
} ],
"message" : "Connection error. Please try again."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:312)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1049)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:410)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
Relevant question references:
Getting high rate of 503 errors with BigQuery Streaming API
BigQuery - BackEnd error when loading from JAVA API
We (the BigQuery team) are looking into your report of increased connection errors. From the internal monitoring, there hasn't been global a spike in connection errors in the last several days. However, that doesn't mean that your tables, specifically, weren't affected.
Connection errors can be tricky to chase down, because they can be caused by errors before they get to the BigQuery servers or after they leave. The more information your can provide, the easier it is for us to diagnose the issue.
The best practice for streaming input is to handle temporary errors like this to retry the request. It can be a little tricky, since when you get a connection error you don't actually know whether the insert succeeded. If you include a unique insertId with your data (see the documentation here), you can safely resend the request (within the deduplication window period, which I think is 15 minutes) without worrying that the same row will get added multiple times.