BigQuery "Max retries exceeded" when running dbt - google-bigquery

When running dbt we randomly have some models failing with the following error:
HTTPSConnectionPool(host=‘bigquery.googleapis.com’, port=443):
Max retries exceeded with url: /bigquery/v2/projects/xxxx/jobs
(Caused by NewConnectionError(‘<urllib3.connection.HTTPSConnection object at 0x7f7fdce6dbb0>:
Failed to establish a new connection: [Errno -3] Temporary failure in name resolution’))
I tried to search online but I could not find anything related to this error and dbt.
Can this be some issue internal of dbt, or the cause is related to something external? Is there a way to prevent this?
We are running dbt targeting BigQuery using a workflow scheduler (Argo) in a GKE cluster.
Thank you! :)

In the end, the problem was the usage of preemptible nodes in GKE. We had those errors when the dbt run was executing during a restart of the kube-dns/kube-proxy service.
We "solved" the problem by applying a retry logic in Argo in case of failures.

Related

AWS DMS FATAL_ERROR Error with replicate-ongoing-changes only

I'm trying to migrate data from Aurora MySQL to S3. Since Aurora MySQL does not support replicating ongoing changes from cluster reader endpoint, my source endpoint is attached to cluster writer endpoint.
When I choose full-load migration only, DMS works. However, i get error Last Error Task 'courral-membership-s3-writer' was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL when i choose full-load + ongoing replication or ongoing replication.
Thanks in advance.
This could be an error caused by - Replication instance class, you may need to upgrade it.

Failed to execute 'setItem' on 'Storage': Setting the value of '136114546' exceeded the quota Vue

I am using Vue on the frontend of my application, the application is running really well on my local machine without any errors but on the server, there are some issues like click events not setting items. When I check the console, however, I get this error: Uncaught DOMException: Failed to execute 'setItem' on 'Storage': Setting the value of '136114546' exceeded the quota.
I found this Related Question where the answer is storage being full but I have unlimited storage and it is working well on my local machine.
What could be the solution to this kind of error? Since it is working well on my local server, could the problem be with the server?

Apache Flink error checkpointing to S3

We have Apache Flink (1.4.2) running on an EMR cluster. We are checkpointing to an S3 bucket, and are pushing about 5,000 records per second through the flows. We recently saw the following error in our logs:
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#ip-XXX-XXX-XXX-XXX:XXXXXX/user/taskmanager#-XXXXXXX]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.messages.TaskManagerMessages$RequestTaskManagerLog".
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:442)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
Immediately after this we got the following in our logs:
2018-07-30 15:08:32,177 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 831 # 1532963312177
2018-07-30 15:09:46,750 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed
java.io.EOFException: Read an incomplete length
at org.apache.flink.runtime.blob.BlobUtils.readLength(BlobUtils.java:366)
at org.apache.flink.runtime.blob.BlobServerConnection.readFileFully(BlobServerConnection.java:403)
at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:349)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:114)
At this point, the flow crashed and was not able to automatically recover, however we were able to restart the flow manually, without needing to change the location of the s3 bucket. The fact that the crash occurred while pushing to S3, makes me think that is the crux of the problem.
Any ideas?
FYI, this was caused by too much cross-talk between nodes flooding the NICs on each server. The solution was more intelligent partitioning.

Google PubSub error [code=8a75]

Today, I started getting this error sporadically. Google pubsub error codes talks about only HTTP error codes. Does anyone know about this error?
ERROR Error: The service was unable to fulfill your request. Please try again. [code=8a75]
This error code is retryable, and can be safely expected. Automating your code to automatically retry with backoff, or to use one of the official client libraries, which automatically retry on these errors with backoff is the recommended solution.
In general these errors should be independent, meaning after a retry or two the odds that your RPC fails should be very low.

tensorflow serving: failed to connect to 'ipv4:127.0.0.1:9000'

I have installed and configured tensorflow serving on an "AWS t2.large Ubuntu 14.04" server.
When I attempt to test the server with the mnist_client utility by executing the command, bazel-bin/tensorflow_serving/example/mnist_client --num_tests=1000 --server=localhost:9000, I receive the following error message:
E0605 05:03:54.617558520 1232 tcp_client_posix.c:191] failed to connect to 'ipv4:127.0.0.1:9000': timeout occurred
Any idea how to fix this?
I haven't heard of anything like this, but did note that (at least with other test clients) when the server was not ready/up yet, requests would timeout. So my guess is that the server is not up yet or perhaps in some other bad state.
I met the same problem before. The root cause is mnist_client was run in my local machine instead of the server, because the command connects to localhost bazel-bin/tensorflow_serving/example/mnist_client --num_tests=1000 --server=localhost:9000
It succeeds when I run mnist_client utility in the server.