BigQuery job intermittent failure with jobInternalError - google-bigquery

We are experiencing a strange issue with Cloud Composer BigQueryOperator tasks (in multiple DAGs) and some scheduled queries. It happens randomly and intermittently and on retrying, the same jobs/queries execute successfully.
Looking into the INFORMATION_SCHEMA.JOBS_BY_PROJECT:
error_result.reason = jobInternalError<br>
error_result.message = The job encountered an internal error during execution and was unable to complete successfully.
But there are bytes billed etc. and the child jobs appear to be successful.
This started happening only since yesterday and we have not made any changes to any of these jobs.
Wondering if others have experienced this recently as well. How do we debug this further or get additional assistance on this?
An example error message received from a cloud Composer Task failure:
Exception:
BigQuery job failed. Final error was: {'reason': 'jobInternalError', 'message': 'The job encountered an internal error during execution and was unable to complete successfully.'}. The job was: {'kind': 'bigquery#job', 'etag': 'cFex61A/InyX/L1+vy8GHw==', 'id': '#################:US.job_FIr-PFixKsdVvOJddG9zuIUeSj1i', 'selfLink': 'https://bigquery.googleapis.com/bigquery/v2/projects/#################/jobs/job_FIr-PFixKsdVvOJddG9zuIUeSj1i?location=US', 'user_email': '############developer.gserviceaccount.com', 'configuration': {'query': {'query': '############################################;\n ', 'priority': 'INTERACTIVE', 'useLegacySql': False}, 'jobType': 'QUERY'}, 'jobReference': {'projectId': '#################', 'jobId': 'job_FIr-PFixKsdVvOJddG9zuIUeSj1i', 'location': 'US'}, 'statistics': {'creationTime': '1629364585265', 'startTime': '1629364585352', 'endTime': '1629365366433', 'totalBytesProcessed': '135043690', 'query': {'totalBytesProcessed': '135043690', 'totalBytesBilled': '492830720', 'totalSlotMs': '192771', 'schema': {'fields': [{'name': 'total_rows', 'type': 'NUMERIC', 'mode': 'NULLABLE'}]}, 'statementType': 'SCRIPT'}, 'totalSlotMs': '192771', 'numChildJobs': '69'}, 'status': {'errorResult': {'reason': 'jobInternalError', 'message': 'The job encountered an internal error during execution and was unable to complete successfully.'}, 'state': 'DONE'}}
Log: Link
Host: airflow-worker-##############
Log file: /home/airflow/gcs/logs/#########################/bq_build_reconciliation/2021-08-18T09:00:00+00:00.log
Mark success: Link
An example failure message from a scheduled query:
2021-08-19T03:06:13.536907177Z
Job scheduled_query_6148e950-0000-2b6a-89c9-94eb2c09dfdc (table ) failed with error INTERNAL: The job encountered an internal error during execution and was unable to complete successfully.; JobID: ##########:scheduled_query_6148e950-0000-2b6a-89c9-94eb2c09dfdc
Any help is appreciated, thanks!

Related

BigQuery API call by BigQuerySinkConnector send error

I have a problem when executing a query on BigQuery and I end up with the following error:
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:568)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:326)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:228)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: A write thread has failed with an unrecoverable error
Caused by: The job encountered an internal error during execution and was unable to complete successfully.
at com.wepay.kafka.connect.bigquery.write.batch.KCBQThreadPoolExecutor.lambda$maybeThrowEncounteredError$0(KCBQThreadPoolExecutor.java:101)
at java.base/java.util.Optional.ifPresent(Optional.java:183)
at com.wepay.kafka.connect.bigquery.write.batch.KCBQThreadPoolExecutor.maybeThrowEncounteredError(KCBQThreadPoolExecutor.java:100)
at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.put(BigQuerySinkTask.java:236)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:546)
... 10 more
Caused by: com.google.cloud.bigquery.BigQueryException: The job encountered an internal error during execution and was unable to complete successfully.
at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:113)
at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:623)
at com.google.cloud.bigquery.BigQueryImpl$34.call(BigQueryImpl.java:1222)
at com.google.cloud.bigquery.BigQueryImpl$34.call(BigQueryImpl.java:1217)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1216)
at com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1200)
at com.google.cloud.bigquery.Job$1.call(Job.java:332)
at com.google.cloud.bigquery.Job$1.call(Job.java:329)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.poll(RetryHelper.java:64)
at com.google.cloud.bigquery.Job.waitForQueryResults(Job.java:328)
at com.google.cloud.bigquery.Job.getQueryResults(Job.java:291)
at com.google.cloud.bigquery.BigQueryImpl.query(BigQueryImpl.java:1187)
at com.wepay.kafka.connect.bigquery.MergeQueries.mergeFlush(MergeQueries.java:158)
at com.wepay.kafka.connect.bigquery.MergeQueries.lambda$mergeFlush$1(MergeQueries.java:119)
... 3 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
GET https://www.googleapis.com/bigquery/v2/projects/car-project-prd/queries/3df89651-3567-4b28-88a8-0655a174574c?location=EU&maxResults=0&prettyPrint=false
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "The job encountered an internal error during execution and was unable to complete successfully.",
"reason" : "jobInternalError"
} ],
"message" : "The job encountered an internal error during execution and was unable to complete successfully.",
"status" : "INVALID_ARGUMENT"
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:149)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:112)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:39)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:443)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1108)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:541)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:474)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:591)
at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:621)
... 20 more
The context is as follows:
I use the Kafka BigQuerySinkConnector to update partitioned tables in a dataset. The connector works perfectly for 1, 2 or 3 days and then fails with the trace above.
Can you tell me if you have an idea of the reason for this error or do you have any leads that I could follow to discover the problem and solve it.
I tried to get information to google help but it's just the following message.
Support Google for Error

Load from GCS to GBQ causes an internal BigQuery error

My application creates thousands of "load jobs" daily to load data from Google Cloud Storage URIs to BigQuery and only a few cases causing the error:
"Finished with errors. Detail: An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 7916072"
The application is written on Python and uses libraries:
google-cloud-storage==1.42.0
google-cloud-bigquery==2.24.1
google-api-python-client==2.37.0
Load job is done by calling
load_job = self._client.load_table_from_uri(
source_uris=source_uri,
destination=destination,
job_config=job_config,
)
this method has a default param:
retry: retries.Retry = DEFAULT_RETRY,
so the job should automatically retry on such errors.
Id of specific job that finished with error:
"load_job_id": "6005ab89-9edf-4767-aaf1-6383af5e04b6"
"load_job_location": "US"
after getting the error the application recreates the job, but it doesn't help.
Subsequent failed job ids:
5f43a466-14aa-48cc-a103-0cfb4e0188a2
43dc3943-4caa-4352-aa40-190a2f97d48d
43084fcd-9642-4516-8718-29b844e226b1
f25ba358-7b9d-455b-b5e5-9a498ab204f7
...
As mentioned in the error message, Wait according to the back-off requirements described in the BigQuery Service Level Agreement, then try the operation again.
If the error continues to occur, if you have a support plan please create a new GCP support case. Otherwise, you can open a new issue on the issue tracker describing your issue. You can also try to reduce the frequency of this error by using Reservations.
For more information about the error messages you can refer to this document.

Can anyone help to understand the following errors where RabbitMQ is going in error state?

I am getting this error again and again.
"operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
For the above error I have already found something from this https://bugzilla.redhat.com/show_bug.cgi?id=1343027 i.e
Rabbit can join the rabbitmq cluster if the controller-0 was rebooted,came up,started all the resources and only when everything works controller-1 goes for the reboot. In other words everything should work when rebooting one of the controllers. If,for some reason, controller-1 reboots while controller-0 not fully recovered after its reboot - things go wrong.
But I am not sure why is the error log file also showing me the below error:
=ERROR REPORT==== 29-Dec-2019::17:44:26 === Mnesia('messaging#rabbit-2'): ** ERROR ** (ignoring core) ** FATAL ** mnesia_monitor crashed: {badarg, [{ets, lookup, [mnesia_decision, 'messaging#rabbit-3'], []}, {mnesia_recover, has_mnesia_down, 1, [{file, "mnesia_recover.erl"}, {line, 299}]}, {mnesia_monitor, check_mnesia_down, 2, [{file, "mnesia_monitor.erl"}, {line, 862}]}, {mnesia_monitor, handle_info, 2, [{file, "mnesia_monitor.erl"}, {line, 579}]}, {gen_server, try_dispatch, 4, [{file, "gen_server.erl"}, {line, 615}]}, {gen_server, handle_msg, 5, [{file, "gen_server.erl"}, {line, 681}]}, {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 240}]}]} state: {state, <0.745.0>, [], [], true, [], undefined, [], []}
The error message says one system process of the Mnesia DB, mnesia_monitor is crashing when it tries to look up a value from an ETS table (mnesia_decision) owned by an other system process of the DB, mnesia_recover. This can only happen if the ETS table no longer exists, that is if the mnesia_recover has stopped.
This error message doesn't say why mnesia_recover has stopped. If it has crashed, there should be an other error message about that event in the log. But it is also possible that the whole Mnesia application has been stopping at that time, because the supervisor would stop mnesia_recover before mnesia_monitor. If that's the case, this error is just caused by bad timing: mnesia_monitor sees the messaging#rabbit-3 node coming up at a point when Mnesia on its node is already shutting down.

Celery Error Handling

I've built a fairly simple application linking Flask, Celery, and RabbitMQ using docker-compose by linking together a few solutions I saw online. I'm having some issues trying to update task states to reflect if a failure occurred. To keep error visibility at it's highest, I've had my custom class only raise expected errors, else the errors are handled at the celery app level as follows (in celery_app.py):
#celery_app.task(name='celery_worker.summary')
def async_summary(data):
"""Background summary processing"""
try:
logger.info('Summarizing text')
return BdsSummary(data, nlp=en_nlp).create_summary()
except Exception as e:
current_task.update_state(state='FAILURE', meta={'error_message': str(traceback.format_exc())})
logger.exception('Text Summary worker raised: %r'%e)
I've been doing some negative testing against my application, and when I pass it data that I know will throw an error (non-text data, for example), when i run r = requests.get('http://my.app.addr:8888/task/my-task-id') I get {'status': 'SUCCESS', 'result': None}. I'm vexed as to why this is happening. Based on my admittedly limited understanding of Celery's behavior, it should update the status to show a traceback and ExceptionClass, why would it not do this?
I am relatively new to Celery, so my understanding of the Canvas that they reference in the documentation is extremely basic. I'm just trying to provide some basic task failure information to the response/task. For context, when I give it proper input, I get back {'status': 'SUCCESS', 'result': {'summary': 'My Summary text here', 'num_sentences': 3, ...}}.
Any insight here would be much appreciated

SQL server 2005 agent not working

Sql server 2005 service pack 2 version: 9.00.3042.00
All maintenance plans fail with the same error.
The details of the error are:-
Execute Maintenance Plan
Execute maintenance plan. test7 (Error)
Messages
Execution failed. See the maintenance plan and SQL Server Agent job history logs for details.
The advanced information section shows the following;
Job 'test7.Subplan_1' failed. (SqlManagerUI)
Program Location:
at Microsoft.SqlServer.Management.SqlManagerUI.MaintenancePlanMenu_Run.PerformActions()
At this point the following appear in the windows event log:
Event Type: Error
Event Source: SQLISPackage
Event Category: None
Event ID: 12291
Date: 28/05/2009
Time: 16:09:08
User: 'DOMAINNAME\username'
Computer: SQLSERVER4
Description:
Package "test7" failed.
and also this:
Event Type: Warning
Event Source: SQLSERVERAGENT
Event Category: Job Engine
Event ID: 208
Date: 28/05/2009
Time: 16:09:10
User: N/A
Computer: SQLSERVER4
Description:
SQL Server Scheduled Job 'test7.Subplan_1' (0x96AE7493BFF39F4FBBAE034AB6DA1C1F) - Status: Failed - Invoked on: 2009-05-28 16:09:02 - Message: The job failed. The Job was invoked by User 'DOMAINNAME\username'. The last step to run was step 1 (Subplan_1).
There are no entries in the SQl Agent log at all.
Probably no points for this, but you're likely to get more help on this over at ServerFault.com now that they are open.