An error has been thrown from the AWS client - error-handling

I got this error when running an Collibra DQ Job via AWS Athena :
An error has been thrown from the AWS Athena client. Query exhausted resources at this scale factor at com.simba.athena.athena.api.AJClient.executeQuery at com.simba.athena.athena.dataengine
Can any one let me know how to address this error
increased the number of executors, memory and driver memory but nothing helped

Related

AWS DMS FATAL_ERROR Error with replicate-ongoing-changes only

I'm trying to migrate data from Aurora MySQL to S3. Since Aurora MySQL does not support replicating ongoing changes from cluster reader endpoint, my source endpoint is attached to cluster writer endpoint.
When I choose full-load migration only, DMS works. However, i get error Last Error Task 'courral-membership-s3-writer' was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL when i choose full-load + ongoing replication or ongoing replication.
Thanks in advance.
This could be an error caused by - Replication instance class, you may need to upgrade it.

AWS Glue Job error: An error occurred while calling o82.parquet. Not Found

We use AWS Glue Jobs for some of our data processing. We use pyspark to process the data, but from time to time we see this error on some step of a job:
An error occurred while calling o82.parquet. Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: ABC111; S3 Extended Request ID: ABC111abc111)
This error seems to be intermittent and sometimes just rerunning the same job with same parameters seems to run fine, but it is definitely not a very descriptive error and we'd like to avoid it as our number of automated jobs grows.
In the Cloudwatch logs the latest logs I see:
WARN [Executor task launch worker for task 1318] client.YarnClient (YarnClient.java:makeRestApiRequest(66)) - The GET request failed for the URL http://0.0.0.0:8088/ws/v1/cluster/apps/application_1583197528647_0001
om.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.HttpHostConnectException: Connect to 0.0.0.0:8088 [/0.0.0.0] failed: Connection refused (Connection refused)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:158)
...
Caused by: java.net.ConnectException: Connection refused (Connection refused)
...
ERROR [SIGTERM handler] executor.CoarseGrainedExecutorBackend (SignalUtils.scala:apply$mcZ$sp(43)) - RECEIVED SIGNAL TERM
The overview of the job:
reading json files using Glue Data Catalog, writes aggregated data into s3 in parquet format (I see a new partition here, but pretty sure it fails at this step as I don't see any messages that I put into the code after this); reads the data from the last step, reads csv mapping file from s3, joins 2 datasets, doing some additional calculations using pyspark, finally writes output to s3 in csv format.

Unable to execute HTTP request: Timeout waiting for connection from pool in Flink

I'm working on an app which uploads some files to an s3 bucket and at a later point, it reads files from s3 bucket and pushes it to my database.
I'm using Flink 1.4.2 and fs.s3a API for reading and write files from the s3 bucket.
Uploading files to s3 bucket works fine without any problem but when the second phase of my app that is reading those uploaded files from s3 starts, my app is throwing following error:
Caused by: java.io.InterruptedIOException: Reopen at position 0 on s3a://myfilepath/a/b/d/4: org.apache.flink.fs.s3hadoop.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:125)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:155)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:702)
at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
at org.apache.flink.api.common.io.GenericCsvInputFormat.open(GenericCsvInputFormat.java:301)
at org.apache.flink.api.java.io.CsvInputFormat.open(CsvInputFormat.java:53)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:160)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:37)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I was able to control this error by increasing the max connection parameter for s3a API.
As of now, I have around 1000 files in the s3 bucket which is pushed and pulled by my app in the s3 bucket and my max connection is 3000. I'm using Flink's parallelism to upload/download these files from s3 bucket. My task manager count is 14.
This is an intermittent failure, I'm having success cases also for this scenario.
My query is,
Why I'm getting an intermittent failure? If the max connection I set was low, then my app should be throwing this error every time I run.
Is there any way to calculate the optimal number of max connection required for my app to work without facing the connection pool timeout error? Or Is this error related to something else that I'm not aware of?
Thanks
In Advance
Some comments, based on my experience with processing lots of files from S3 via Flink (batch) workflows:
When you are reading the files, Flink will calculate "splits" based on the number of files, and each file's size. Each split is read separately, so the theoretical max # of simultaneous connections isn't based on the # of files, but a combination of files and file sizes.
The connection pool used by the HTTP client releases connections after some amount of time, as being able to reuse an existing connection is a win (server/client handshake doesn't have to happen). So that introduces a degree of randomness into how many available connections are in the pool.
The size of the connection pool doesn't impact memory much, so I typically set it pretty high (e.g. 4096 for a recent workflow).
When using AWS connection code, the setting to bump is fs.s3.maxConnections, which isn't the same as a pure Hadoop configuration.

Apache Flink error checkpointing to S3

We have Apache Flink (1.4.2) running on an EMR cluster. We are checkpointing to an S3 bucket, and are pushing about 5,000 records per second through the flows. We recently saw the following error in our logs:
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#ip-XXX-XXX-XXX-XXX:XXXXXX/user/taskmanager#-XXXXXXX]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.messages.TaskManagerMessages$RequestTaskManagerLog".
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:442)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
Immediately after this we got the following in our logs:
2018-07-30 15:08:32,177 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 831 # 1532963312177
2018-07-30 15:09:46,750 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed
java.io.EOFException: Read an incomplete length
at org.apache.flink.runtime.blob.BlobUtils.readLength(BlobUtils.java:366)
at org.apache.flink.runtime.blob.BlobServerConnection.readFileFully(BlobServerConnection.java:403)
at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:349)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:114)
At this point, the flow crashed and was not able to automatically recover, however we were able to restart the flow manually, without needing to change the location of the s3 bucket. The fact that the crash occurred while pushing to S3, makes me think that is the crux of the problem.
Any ideas?
FYI, this was caused by too much cross-talk between nodes flooding the NICs on each server. The solution was more intelligent partitioning.

Redshift Spectrum / The bucket you are attempting to access must be addressed using the specified endpoint

I created a parquet file in S3 and an external table pointing to it in Redshift / Spectrum. Both my S3 bucket and Redshift cluster are in us-west-2. I specified the option region when creating the schema.
Queries run smoothly in Athena.
Yet when I run from Redshift client, I get this error:
Amazon Invalid operation: S3 Query Exception (Fetch)
Details:
error: S3 Query Exception (Fetch)
code: 15001
context: Task failed due to an internal error.
HTTP response error code: 301 Message: PermanentRedirect The bucket you are attempting to access must be addressed using the specified endpoint. >Please send all future requests to this endpoint.
x-amz-request-id: XXXX
query: XXXXX
location: dory_util.cpp:689
process: query0_40 [pid=XXX]
-----------------------------------------------;
AWS has acknowledged the issue and released a patch overnight.
Please make sure that your Redshift cluster is running with at least version 1.0.14016 in us-east-2 or us-west-2 and 1.0.1407 in us-east-1. To apply the patch to Redshift immediately, move the maintenance window of your cluster closer to the current time and day to pick it up at your convenience.