Camel AWS-S3 - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection - amazon-s3

I am using camel-aws to poll a file onto the remote S3 bucket to check if it has arrived or not.
I am not interested in the content of the file.
from("direct:my-route").
.from("aws-s3://my.bucket?useIAMCredentials=true&useAwsKMS=true&awsKMSKeyId=my-key-id&deleteAfterRead=false&operation=listObjects&includeBody=false&prefix=test1/etmp_xi_inbound.xml")
.log(" File detected: ${header.CamelAwsS3Key}")
.end();
I have set the includeBody to false to not to read the content of the file however I am getting below warning:
WARN c.a.s.s.i.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

Do you have autoCloseBody set to true? It seems that potentially in newer versions of Camel that they auto close the s3 connection and so having autoCloseBody=true means you are trying to close an already closed connection hence causing the error.

Related

Kafka Connect S3 source throws java.io.IOException

Kafka Connect S3 source connector throws the following exception around 20 seconds into reading an S3 bucket:
Caused by: java.io.IOException: Attempted read on closed stream.
at org.apache.http.conn.EofSensorInputStream.isReadAllowed(EofSensorInputStream.java:107)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:133)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
The error is preceded by the following warnning:
WARN Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use. (com.amazonaws.services.s3.internal.S3AbortableInputStream:178)
I am running Kafka connect out of this image: confluentinc/cp-kafka-connect-base:6.2.0. Using the confluentinc-kafka-connect-s3-source-2.1.1 jar.
My source connector configuration looks like so:
{
"connector.class":"io.confluent.connect.s3.source.S3SourceConnector",
"tasks.max":"1",
"s3.region":"eu-central-1",
"s3.bucket.name":"test-bucket-yordan",
"topics.dir":"test-bucket/topics",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class":"io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility":"NONE",
"confluent.topic.bootstrap.servers": "blockchain-kafka-kafka-0.blockchain-kafka-kafka-headless.default.svc.cluster.local:9092",
"transforms":"AddPrefix",
"transforms.AddPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.AddPrefix.regex":".*",
"transforms.AddPrefix.replacement":"$0_copy"
}
Any ideas on what might be the issue? Also I was unable to find the repository of Kafka connect S3 source connector, is it opensource?
Edit: I don't see the problem if gzip compression on the kafka-connect sink is disabled.
The warning means that close()was called before the file was read. S3 was not done with sending the data but the connection was left hanging.
2 options:
Validate that the input stream contains no more data. That way the connection can be reused
Call s3ObjectInputStream.abort() (NOTE: this connection could not be reused if you abort the input stream and a new one will need to created which will have performance impact.) In some cases this might make sense e.g. when the read is getting too slow etc.

Mule 3.9 logs HTTP response sending task failed with error: Locally closed

I use ApiKit to receive queries. Occasionally I get the following line in a log file:
WARN org.mule.module.http.internal.listener.grizzly.ResponseCompletionHandler - HTTP response sending task failed with error: Locally closed
It seems that in this case the integration has not sent a response to the party that called the integration. I thought that there might be a sort of timeout before ApiKit closes the connection to caller but based on timestamps that might not be the case as everything happens within a second.
In this case the payload is sent to Artemis queue before this warning and despite the warning the message is read from Artemis normally and the whole flow works otherwise just fine besides this warning and not sending the response.
So, am I correct when I think that this warning is an indication why the response is not sent? In addition to that what can be done to prevent this situation?

Is it possible to upload large files to Ktor & Netty server?

I was making a simple file upload&download service and found out that, as far as I understand, Netty doesn't release direct buffers until request processing is over. As a result, I can't upload bigger files.
I was trying to make sure that the problem is not inside my code, so I created the most simple tiny Ktor application:
routing {
post("upload") {
call.receiveMultipart().forEachPart {}
call.respond(HttpStatusCode.OK)
}
}
The default direct memory size is about 3Gb, to make test simpler I limit it with:
System.setProperty("io.netty.maxDirectMemory", (10 * 1024 * 1024).toString())
before starting the NettyApplicationEngine.
Now if I upload a large file, for example with httpie, I got "Connection reset":
http -v --form POST http://localhost:42195/upload file#/tmp/FileStorageLoadTest-test-data1.tmp
http: error: ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) while doing POST request to URL: http://localhost:42195/upload
On the server side there is no information about the problem except for the "java.io.IOException: Broken delimiter occurred" exception. But if I put the breakpoint in NettyResponsePipeline#processCallFailed, the real exception is:
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 65536 byte(s) of direct memory (used: 10420231, max: 10485760)
It is a pity that this exception is not logged.
Also, I found out that the same code works without problems if I use Jetty engine instead.
Environment:
Ubuntu Linux
Java 8
Ktor=1.2.5
netty-transport-native-epoll=4.1.43.Final
(but if Netty started without native-epoll support, the problem is the same)

Unable to execute HTTP request: Timeout waiting for connection from pool in Flink

I'm working on an app which uploads some files to an s3 bucket and at a later point, it reads files from s3 bucket and pushes it to my database.
I'm using Flink 1.4.2 and fs.s3a API for reading and write files from the s3 bucket.
Uploading files to s3 bucket works fine without any problem but when the second phase of my app that is reading those uploaded files from s3 starts, my app is throwing following error:
Caused by: java.io.InterruptedIOException: Reopen at position 0 on s3a://myfilepath/a/b/d/4: org.apache.flink.fs.s3hadoop.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:125)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:155)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:702)
at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
at org.apache.flink.api.common.io.GenericCsvInputFormat.open(GenericCsvInputFormat.java:301)
at org.apache.flink.api.java.io.CsvInputFormat.open(CsvInputFormat.java:53)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:160)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:37)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I was able to control this error by increasing the max connection parameter for s3a API.
As of now, I have around 1000 files in the s3 bucket which is pushed and pulled by my app in the s3 bucket and my max connection is 3000. I'm using Flink's parallelism to upload/download these files from s3 bucket. My task manager count is 14.
This is an intermittent failure, I'm having success cases also for this scenario.
My query is,
Why I'm getting an intermittent failure? If the max connection I set was low, then my app should be throwing this error every time I run.
Is there any way to calculate the optimal number of max connection required for my app to work without facing the connection pool timeout error? Or Is this error related to something else that I'm not aware of?
Thanks
In Advance
Some comments, based on my experience with processing lots of files from S3 via Flink (batch) workflows:
When you are reading the files, Flink will calculate "splits" based on the number of files, and each file's size. Each split is read separately, so the theoretical max # of simultaneous connections isn't based on the # of files, but a combination of files and file sizes.
The connection pool used by the HTTP client releases connections after some amount of time, as being able to reuse an existing connection is a win (server/client handshake doesn't have to happen). So that introduces a degree of randomness into how many available connections are in the pool.
The size of the connection pool doesn't impact memory much, so I typically set it pretty high (e.g. 4096 for a recent workflow).
When using AWS connection code, the setting to bump is fs.s3.maxConnections, which isn't the same as a pure Hadoop configuration.

Amazon S3 File Read Timeout. Trying to download a file using JAVA

New to Amazon S3 usage.I get the following error when trying to access the file from Amazon S3 using a simple java method.
2016-08-23 09:46:48 INFO request:450 - Received successful response:200, AWS Request ID: F5EA01DB74D0D0F5
Caught an AmazonClientException, which means the client encountered an
internal error while trying to communicate with S3, such as not being
able to access the network.
Error Message: Unable to store object contents to disk: Read timed out
The exact lines of code worked yesterday.I was able to download 100% of 5GB file in 12 min. Today I'm in a better connected environment but only 2% or 3% of the file is downloaded and then the program fails.
Code that I'm using to download.
s3Client.getObject(new GetObjectRequest("mybucket", file.getKey()), localFile);
You need to set the connection timeout and the socket timeout in your client configuration.
Click here for a reference article
Here is an excerpt from the article:
Several HTTP transport options can be configured through the com.amazonaws.ClientConfiguration object. Default values will suffice for the majority of users, but users who want more control can configure:
Socket timeout
Connection timeout
Maximum retry attempts for retry-able errors
Maximum open HTTP connections
Here is an example on how to do it:
Downloading files >3Gb from S3 fails with "SocketTimeoutException: Read timed out"