Load from GCS to GBQ causes an internal BigQuery error - google-bigquery

My application creates thousands of "load jobs" daily to load data from Google Cloud Storage URIs to BigQuery and only a few cases causing the error:
"Finished with errors. Detail: An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 7916072"
The application is written on Python and uses libraries:
google-cloud-storage==1.42.0
google-cloud-bigquery==2.24.1
google-api-python-client==2.37.0
Load job is done by calling
load_job = self._client.load_table_from_uri(
source_uris=source_uri,
destination=destination,
job_config=job_config,
)
this method has a default param:
retry: retries.Retry = DEFAULT_RETRY,
so the job should automatically retry on such errors.
Id of specific job that finished with error:
"load_job_id": "6005ab89-9edf-4767-aaf1-6383af5e04b6"
"load_job_location": "US"
after getting the error the application recreates the job, but it doesn't help.
Subsequent failed job ids:
5f43a466-14aa-48cc-a103-0cfb4e0188a2
43dc3943-4caa-4352-aa40-190a2f97d48d
43084fcd-9642-4516-8718-29b844e226b1
f25ba358-7b9d-455b-b5e5-9a498ab204f7
...

As mentioned in the error message, Wait according to the back-off requirements described in the BigQuery Service Level Agreement, then try the operation again.
If the error continues to occur, if you have a support plan please create a new GCP support case. Otherwise, you can open a new issue on the issue tracker describing your issue. You can also try to reduce the frequency of this error by using Reservations.
For more information about the error messages you can refer to this document.

Related

Datastream Troubleshoot: "An unknown error occurred. Please try again. If the error persists, contact Google support"

We are trying to replicate data from AlloyDB to Bigquery using Datastream.
We Get "An unknown error occurred. Please try again. If the error persists, contact Google support."
In the Datastream console --> objects list, we see all source tables with Object Status "Failed" and Backfill status "Completed".
In Bigquery we see only a subset of the tables (not all the "Completed" objects were synced).
In the Logs Explorer I can see this error on BQ:
I also see this error: error: {
code: 11
message: "Unsupported primary key column either does not exist or is a pseudocolumn at [1:401]"
}
The column referred in the error is of type enum.
The desired situation is having all the AlloyDB tables replicated into Bigquery.
The error message is not very informative...
What does it mean?
What would be the best way to go about troubleshooting this?
We're actively working on making these error messages be more informative, and improvements are continuously being rolled out as we identify more edge cases. Assuming you followed all the steps in the documentation, then you may need to open a ticket with support for further investigation. If a support ticket isn't an option, you can still report the issue using the public issue tracker
I just had this same issue but connecting to a PostgreSQL in AWS RDS:
Beginning with Postgres 10, passwords are encrypted using SCRAM-SHA-256 in PostgreSQL. Google DataStream still expects MD5 password encryption, or it will generate an "unknown error" in the logs and fail the backfills.
You'll need to update your postgresql.conf (or RDS Cluster Parameter Group if you're using AWS like me):
password_encryption = 'MD5'
Restart the database and make sure the parameter has changed with:
SHOW password_encryption;
Reset the password of your users:
ALTER USER "{username}" with password '{password}';
More info from the PostgreSQL docs: https://www.postgresql.org/docs/current/auth-password.html

Issues while running with BigQueryIO.Write.Method.STORAGE_WRITE_API

We are testing with STORAGE_WRITE_API to insert data into BigQuery. We've seen several errors/warnings in our Dataflow pipeline(written in Java). It might work well in the beginning, but eventually the system lag would be increasing, it would stop processing any data from PubSub and the unacked messages piled up.
One common warning is:
Operation ongoing in step insertTableRowsToBigQuery/StorageApiLoads/StorageApiWriteSharded/Write Records for at least 03h35m00s without outputting or completing in state process
at java.base#11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
at java.base#11.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
at java.base#11.0.9/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Callback.await(RetryManager.java:153)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Operation.await(RetryManager.java:136)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager.await(RetryManager.java:256)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager.run(RetryManager.java:248)
at app//org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn.process(StorageApiWritesShardedRecords.java:453)
at app//org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
Other exceptions we've seen:
Got error io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Stream is closed
Got error io.grpc.StatusRuntimeException: ALREADY_EXIST
PodSandboxStatus of sandbox "..." for pod "df-...-pipeline-...-harness-qw4j_default(...)" error: rpc error: code = Unknown desc = Error: No such container
Code sample:
toBq.apply("insertTableRowsToBigQuery",
BigQueryIO
.writeTableRows()
.to(String.format("%s:%s.%s", PROJECT_ID, DATASET, table))
.withTriggeringFrequency(Duration.standardSeconds(options.getTriggeringFrequency()))
.withNumStorageWriteApiStreams(options.getNumStorageWriteApiStreams())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
There was a production issue related to connection being stuck after streaming 10MB which has been fixed. If you try again, it should work.

How to receive root cause for Pipeline Dataflow job failure

I am running my pipeline in Dataflow. I want to collect all error messages from Dataflow job using its id. I am using Apache-beam 2.3.0 and Java 8.
DataflowPipelineJob dataflowPipelineJob = ((DataflowPipelineJob) entry.getValue());
String jobId = dataflowPipelineJob.getJobId();
DataflowClient client = DataflowClient.create(options);
Job job = client.getJob(jobId);
Is there any way to receive only error message from pipeline?
Programmatic support for reading Dataflow log messages is not very mature, but there are a couple options:
Since you already have the DataflowPipelineJob instance, you could use the waitUntilFinish() overload which accepts a JobMessagesHandler parameter to filter and capture error messages. You can see how DataflowPipelineJob uses this in its own waitUntilFinish() implementation.
Alternatively, you can query job logs using the Dataflow REST API: projects.jobs.messages/list. The API takes in a minimumImportance parameter which would allow you to query just for errors.
Note that in both cases, there may be error messages which are not fatal and don't directly cause job failure.

Datastax: Block not found error from DSEFS

Spark streaming job running in DSE using DSEFS for check-pointing directory. I see this error in debug log file. How to resolve this error?
ERROR [dsefs-netty-worker-5] 2017-12-01 05:23:02,679 DSE-FS RestServerHandler.scala:126 - [id: 0x9964e082, /<>:58874 :> 0.0.0.0/0.0.0.0:5598] Streaming data to remote end failed.
java.io.IOException: Block not found a3859f30-aa23-11e7-80b9-4b8bdaf197cd
at com.datastax.bdp.fs.server.blocks.BlockService$stateMachine$33$1.apply(BlockService.scala:706) ~[dsefs-server_2.10-5.0.19.jar:5.0.19]
at com.datastax.bdp.fs.server.blocks.BlockService$stateMachine$33$1.apply(BlockService.scala:703) ~[dsefs-server_2.10-5.0.19.jar:5.0.19]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [scala-library-2.10.6.jar:na]
at com.datastax.bdp.fs.exec.SameThreadExecutionContext$class.executeInSameThread(SameThreadExecutionContext.scala:24) ~[dsefs-common_2.10-5.0.19.jar:5.0.19]
at com.datastax.bdp.fs.exec.SameThreadExecutionContext$class.execute(SameThreadExecutionContext.scala:33) ~[dsefs-common_2.10-5.0.19.jar:5.0.19]
at com.datastax.bdp.fs.exec.SerialExecutionContextProvider$$anon$5$$anon$2.execute(SerialExecutionContextProvider.scala:24) ~[dsefs-common_2.10-5.0.19.jar:5.0.19]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) [scala-library-2.10.6.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) ~[scala-library-2.10.6.jar:na]
at scala.concurrent.Promise$class.complete(Promise.scala:55) ~[scala-library-2.10.6.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) ~[scala-library-2.10.6.jar:na]
at com.datastax.bdp.fs.server.blocks.BlockService$stateMachine$1$1.apply(BlockService.scala:60) ~[dsefs-server_2.10-5.0.19.jar:5.0.19]
at com.datastax.bdp.fs.server.blocks.BlockService$stateMachine$1$1.apply(BlockService.scala:60) ~[dsefs-server_2.10-5.0.19.jar:5.0.19]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [scala-library-2.10.6.jar:na]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) [netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) [netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) [netty-all-4.0.34.Final.jar:4.0.34.Final]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_112]
This error means DSEFS server failed to find metadata of the data block in the dsefs.blocks Cassandra table. The ids of the file blocks are stored in the dsefs.block_offsets table and they reference blocks stored in dsefs.blocks. If a row exists in dsefs.block_offsets and points to the block id that is absent in dsefs.blocks, you get this error when reading the file.
This error should not happen under normal circumstances and it means the filesystem metadata somehow got into inconsistent state. This may be a bug in the DSEFS implementation, a result of a data loss caused by setting up dsefs keyspace with insufficient replication factor or a result of a write operation that did not finish successfully and was applied only partially.
Please make sure you set dsefs keyspace RF to at least 3 and run nodetool repair to avoid accidental data loss or unavailability of some DSEFS metadata.
If this doesn't help, please contact me directly or through DataStax technical support and provide more details, including logs from the time before the error and more context on what the job was doing when the failure occurred.

BigQuery loads manually but not through the Java SDK

I have a Dataflow pipeline, running locally. The objective is to read a JSON file using TEXTIO, make sessions and load it into BigQuery. Given the structure I have to create a temp directory in GCS and then load it into BigQuery using that. Previously I had a data schema error that prevented me to load the data, see here. That issue is resolved.
So now when I run the pipeline locally it ends with dumping a temporary JSON newline delimited file into GCS. The SDK then gives me the following:
Starting BigQuery load job beam_job_xxxx_00001-1: try 1/3
INFO [main] (BigQueryIO.java:2191) - BigQuery load job failed: beam_job_xxxx_00001-1
...
Exception in thread "main" com.google.cloud.dataflow.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:187)
at pedesys.Dataflow.main(Dataflow.java:148)
Caused by: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.load(BigQueryIO.java:2198)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.processElement(BigQueryIO.java:2146)
The errors are not very descriptive and the data is still not loaded in BigQuery. What is puzzling is that if I go to the BigQuery UI and load the same temporary file from GCS that was dumped by the SDK's Dataflow pipeline manually, in the same table, it works beautifully.
The relevant code parts are as follows:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://test/temp");
Pipeline p = Pipeline.create(options)
...
...
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.named("loadJob")
.to("myproject:db.table")
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
The SDK is swallowing the error/exception and not reporting it to users. It's most likely a schema problem. To get the actual error that is happening you need to fetch the job details by either:
CLI - bq show -j job beam_job_<xxxx>_00001-1
Browser/Web: use "try it" at the bottom of the page here.
#jkff has raised an issue here to improve the error reporting.