Amazon S3 File Read Timeout. Trying to download a file using JAVA - amazon-s3

New to Amazon S3 usage.I get the following error when trying to access the file from Amazon S3 using a simple java method.
2016-08-23 09:46:48 INFO request:450 - Received successful response:200, AWS Request ID: F5EA01DB74D0D0F5
Caught an AmazonClientException, which means the client encountered an
internal error while trying to communicate with S3, such as not being
able to access the network.
Error Message: Unable to store object contents to disk: Read timed out
The exact lines of code worked yesterday.I was able to download 100% of 5GB file in 12 min. Today I'm in a better connected environment but only 2% or 3% of the file is downloaded and then the program fails.
Code that I'm using to download.
s3Client.getObject(new GetObjectRequest("mybucket", file.getKey()), localFile);

You need to set the connection timeout and the socket timeout in your client configuration.
Click here for a reference article
Here is an excerpt from the article:
Several HTTP transport options can be configured through the com.amazonaws.ClientConfiguration object. Default values will suffice for the majority of users, but users who want more control can configure:
Socket timeout
Connection timeout
Maximum retry attempts for retry-able errors
Maximum open HTTP connections
Here is an example on how to do it:
Downloading files >3Gb from S3 fails with "SocketTimeoutException: Read timed out"

Related

Kafka Connect S3 source throws java.io.IOException

Kafka Connect S3 source connector throws the following exception around 20 seconds into reading an S3 bucket:
Caused by: java.io.IOException: Attempted read on closed stream.
at org.apache.http.conn.EofSensorInputStream.isReadAllowed(EofSensorInputStream.java:107)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:133)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
The error is preceded by the following warnning:
WARN Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use. (com.amazonaws.services.s3.internal.S3AbortableInputStream:178)
I am running Kafka connect out of this image: confluentinc/cp-kafka-connect-base:6.2.0. Using the confluentinc-kafka-connect-s3-source-2.1.1 jar.
My source connector configuration looks like so:
{
"connector.class":"io.confluent.connect.s3.source.S3SourceConnector",
"tasks.max":"1",
"s3.region":"eu-central-1",
"s3.bucket.name":"test-bucket-yordan",
"topics.dir":"test-bucket/topics",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class":"io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility":"NONE",
"confluent.topic.bootstrap.servers": "blockchain-kafka-kafka-0.blockchain-kafka-kafka-headless.default.svc.cluster.local:9092",
"transforms":"AddPrefix",
"transforms.AddPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.AddPrefix.regex":".*",
"transforms.AddPrefix.replacement":"$0_copy"
}
Any ideas on what might be the issue? Also I was unable to find the repository of Kafka connect S3 source connector, is it opensource?
Edit: I don't see the problem if gzip compression on the kafka-connect sink is disabled.
The warning means that close()was called before the file was read. S3 was not done with sending the data but the connection was left hanging.
2 options:
Validate that the input stream contains no more data. That way the connection can be reused
Call s3ObjectInputStream.abort() (NOTE: this connection could not be reused if you abort the input stream and a new one will need to created which will have performance impact.) In some cases this might make sense e.g. when the read is getting too slow etc.

Camel AWS-S3 - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection

I am using camel-aws to poll a file onto the remote S3 bucket to check if it has arrived or not.
I am not interested in the content of the file.
from("direct:my-route").
.from("aws-s3://my.bucket?useIAMCredentials=true&useAwsKMS=true&awsKMSKeyId=my-key-id&deleteAfterRead=false&operation=listObjects&includeBody=false&prefix=test1/etmp_xi_inbound.xml")
.log(" File detected: ${header.CamelAwsS3Key}")
.end();
I have set the includeBody to false to not to read the content of the file however I am getting below warning:
WARN c.a.s.s.i.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Do you have autoCloseBody set to true? It seems that potentially in newer versions of Camel that they auto close the s3 connection and so having autoCloseBody=true means you are trying to close an already closed connection hence causing the error.

Unable to execute HTTP request: Timeout waiting for connection from pool in Flink

I'm working on an app which uploads some files to an s3 bucket and at a later point, it reads files from s3 bucket and pushes it to my database.
I'm using Flink 1.4.2 and fs.s3a API for reading and write files from the s3 bucket.
Uploading files to s3 bucket works fine without any problem but when the second phase of my app that is reading those uploaded files from s3 starts, my app is throwing following error:
Caused by: java.io.InterruptedIOException: Reopen at position 0 on s3a://myfilepath/a/b/d/4: org.apache.flink.fs.s3hadoop.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:125)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:155)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:702)
at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
at org.apache.flink.api.common.io.GenericCsvInputFormat.open(GenericCsvInputFormat.java:301)
at org.apache.flink.api.java.io.CsvInputFormat.open(CsvInputFormat.java:53)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:160)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:37)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I was able to control this error by increasing the max connection parameter for s3a API.
As of now, I have around 1000 files in the s3 bucket which is pushed and pulled by my app in the s3 bucket and my max connection is 3000. I'm using Flink's parallelism to upload/download these files from s3 bucket. My task manager count is 14.
This is an intermittent failure, I'm having success cases also for this scenario.
My query is,
Why I'm getting an intermittent failure? If the max connection I set was low, then my app should be throwing this error every time I run.
Is there any way to calculate the optimal number of max connection required for my app to work without facing the connection pool timeout error? Or Is this error related to something else that I'm not aware of?
Thanks
In Advance
Some comments, based on my experience with processing lots of files from S3 via Flink (batch) workflows:
When you are reading the files, Flink will calculate "splits" based on the number of files, and each file's size. Each split is read separately, so the theoretical max # of simultaneous connections isn't based on the # of files, but a combination of files and file sizes.
The connection pool used by the HTTP client releases connections after some amount of time, as being able to reuse an existing connection is a win (server/client handshake doesn't have to happen). So that introduces a degree of randomness into how many available connections are in the pool.
The size of the connection pool doesn't impact memory much, so I typically set it pretty high (e.g. 4096 for a recent workflow).
When using AWS connection code, the setting to bump is fs.s3.maxConnections, which isn't the same as a pure Hadoop configuration.

Amazon PutObjectRequest for large files throws error

I am getting an error while uploading large files more than 50 MB using
PUTObjectRequest . It throws an error unable to write data to the transport connection: An existing connection was forcibly closed
by the remote host.
I am using federated user for this putobjectrequest.
Please help me out to solve this issue.
I am sending multiple files parallel using task as
Task.Factory.StartNew(()=>{
PutObjectRequest req=new PutObjectRequest()
{
bucketName=_bucketName,
key=fileKey,
FilePath=demoPath
};
PutObjectResponse resp= client.PurObjectRequest(req);
}
);
This was a bug in aws sdk version i was using 2.3.20
Now i am using aws sdk version 2.3.40 and it is working fine. Basically it was error due to time difference between client machine and server fixed in the updated aws sdk dll

Apache upload failed when file size is over 100k

Below it is some information about my problem.
Our Apache2.2 is on windows 2008 server.
Basically the problem is user fails to upload file which is bigger than 100k to our server.
The error in Apache log is: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. : Error reading request entity data, referer: ......
There were a few times (not always) I could upload larger files(100k-800k, failed for 20m) in Chrome. In FF4 it always fails for uploading file over 100k. In IE8 it is similar to FF4.
It seems that it fails to get request from client, then I reset TimeOut in Apache setting to default value(300) which did not help at all.
I do not have the RequestLimitBody option set up and I am not using PHP. Anyone saw the similar error before? Now I am not sure what I can try next. Any advise would be appreciated!
Edit:
I just tried to use remote desk to upload files on the server and it worked fine. First thought was about the firewall which however is off all the time, Http Proxy is applied though.