Why we need to setReadLimit(int) in AWS S3 Java client - amazon-s3

I am working on AWS Java S3 Library.
This is my code which is uploading the file to s3 using High-Level API of AWS.
ClientConfiguration configuration = new ClientConfiguration();
configuration.setUseGzip(true);
configuration.setConnectionTTL(1000 * 60 * 60);
AmazonS3Client amazonS3Client = new AmazonS3Client(configuration);
TransferManager transferManager = new TransferManager(amazonS3Client);
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(message.getBodyLength());
objectMetadata.setContentType("image/jpg");
transferManager.getConfiguration().setMultipartUploadThreshold(1024 * 10);
PutObjectRequest request = new PutObjectRequest("test", "/image/test", inputStream, objectMetadata);
request.getRequestClientOptions().setReadLimit(1024 * 10);
request.setSdkClientExecutionTimeout(1000 * 60 * 60);
Upload upload = transferManager.upload(request);
upload.waitForCompletion();
I am trying to upload a large file. It is working properly but sometimes I am getting below error. I have set readLimit as (1024*10).
2019-04-05 06:41:05,679 ERROR [com.demo.AwsS3TransferThread] (Aws-S3-upload) Error in saving File[media/image/osc/54/54ec3f2f-a938-473c-94b7-a55f39aac4a6.png] on S3[demo-test]: com.amazonaws.ResetException: Failed to reset the request input stream; If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.resetRequestInputStream(AmazonHttpClient.java:1221)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1042)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:948)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:661)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:635)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:618)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:586)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:573)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:445)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4041)
at com.amazonaws.services.s3.AmazonS3Client.doUploadPart(AmazonS3Client.java:3041)
at com.amazonaws.services.s3.AmazonS3Client.uploadPart(AmazonS3Client.java:3026)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadPartsInSeries(UploadCallable.java:255)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInParts(UploadCallable.java:189)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:121)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:139)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:47)
What is the perpose of readLimit?
How it will usefull?
What should I do to avoid this kind of exception?

After researching on this for 1 week,
I have found that if your uploading file size is less than 48GB then you can set readLimit value 5.01MB.
because AWS dividing file into multiple parts and each part size is value is 5MB(If you have not changed minimum part size value). as per the AWS specs, last part size is less than 5MB. so I have set readLimit 5MB and it solves the issue.
InputStream readLimit purpose:
Marks the current position in this input stream. A subsequent call to the reset method repositions this stream at the last marked position so that subsequent reads re-read the same bytes.Readlimit arguments tells this input stream to allow that many bytes to be read before the mark position gets invalidated. The general contract of mark is that, if the method markSupported returns, the stream somehow remembers all the bytes read after the call to mark and stands ready to supply those same bytes again if and whenever the method reset is called. However, the stream is not required to remember any data at all if more than readLimit bytes are read from the stream before reset is called.

Related

Is there a way to control the number of bytes read in Reactor Netty's TcpClient?

I am using TcpClient to connect to a simple TCP echo server. Messages consist of the message size in 4 bytes followed by the message itself. For instance, to send the message "hello", the server will expect "0005hello", and respond with "0005hello".
When testing under load (approximately 300+ concurrent users), adjacent requests sometimes result in responses "piling up", e.g. sending "0004good" followed by "0003day" might result in the client receiving "0004good0003" followed by "day".
In a conventional, non-WebFlux-based TCP client, one would normally read the first 4 bytes from the socket into a buffer, determine the length of the message N, then read the following N bytes from the socket into a buffer, before returning the response. Is it possible to achieve such fine-grained control, perhaps by using TcpClient's underlying Channel?
I have also considered the approach of accumulating responses in some data structure (Queue, StringBuffer, etc.) and having a daemon parse the result, but this has not had the desired performance in practice.
I solved this by adding a handler of type LengthFieldBasedFrameDecoder to the Connection:
TcpClient.create()
.host(ADDRESS)
.port(PORT)
.doOnConnected((connection) -> {
connection.addHandler("parseLengthFromFirstFourBytes", new LengthFieldBasedFrameDecoder(9999, 0, 4) {
#Override
protected long getUnadjustedFrameLength(ByteBuf buf, int offset, int length, ByteOrder order) {
ByteBuf lengthBuffer = buf.copy(0, 4);
byte[] messageLengthBytes = new byte[4];
lengthBuffer.readBytes(messageLengthBytes);
String messageLengthString = new String(messageLengthBytes);
return Long.parseLong(messageLengthString);
}
});
})
.connect()
.subscribe();
This solves the issue with the caveat that responses still "pile up" (as described in the question) when the application is subjected to sufficient load.

Node AWS.S3 SDK upload timeout

Using the Node AWS SDK S3.upload method is not completing multi part uploads for some reason.
A readable stream that receives uploads from a browser is set as the Body (the readable stream is able to be be piped to file writableStream without any problems).
S3.upload is given the following options object:
{
partSize: 1024*1024*5,
queueSize: 1
}
When trying to upload a ~8.5mb file, the file is completely sent from the browser, but the request returned from S3.upload continually fires 'httpUploadProgress' events that indicate that all bytes have been uploaded. The following is received continually until the error occurs:
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
RequestTimeout: Your socket connection to the server was not read from
or written to within the timeout period. Idle connections will be
closed.
The progress loaded field shows that it has loaded the total bytes, but the upload is never completed. Even the end event on the readable stream fires.
Console logging in the SDK itself shows that S3.upload consumes all the available data from the readable stream even if the part size is set to 5mb and the queue size is set to 1.
Does the part size and queue size have an impact on proper usage of S3.upload? How can this problem be investigated further?
I had to use createMultipartUpload and uploadPart for my larger (8Mb) file upload.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#uploadPart-property

AWS S3 file upload - 1 gb file

I am trying to upload large files (less than 5 GB, hence not multipart upload, normal upload) using java sdk. Smaller files gets uploaded in no time. but files which are above 1 MB, doesnt upload. My code gets stuck in the lince where actual upload happens. I tried using transfer manager (TransferManager.upload) function, when I check the number of bytes transferred, it keeps transferring more than 1 MB and keeps running until I force stop my java application. what could be the reason, where am I going wrong. same code works for smaller files. Issue is only with larger files.
DefaultAWSCredentialsProviderChain credentialProviderChain = new DefaultAWSCredentialsProviderChain();
TransferManager tx = new TransferManager(credentialProviderChain.getCredentials());
Upload myUpload = tx.upload(S3bucket,fileKey, file);
while(myUpload.isDone() == false) {
System.out.println("Transfer: " + myUpload.getDescription());
System.out.println(" - State: " + myUpload.getState());
System.out.println(" - Progress: "
+ myUpload.getProgress().getBytesTransferred());
}
s3Client.upload(new PutObjectRequest(S3bucket,fileKey, file));
Tried both transfer manager upload and putobject methods. Same issue with both.
TIA.

blob.uploadtext to the same blob from multiple threads

A beginner question but I could not find a definite answer anywhere else.
What will happen if I use blob.uploadtext() to update the same blob at the same time from two different threads? Will the latter just fail? Or the latter just overwrite?
Also, what happens if I read from the blob while the uploadtext() is in progress? Will the read return a previous version? Or will the read fail? If the read fails, should I retry?
In my scenario, I have multiple instances of the worker role that may update the same blob (of size 10 MB) from time to time. I do not really care about who wins the write as long as they do not corrupt the blob and reading the blob is not blocked for a long time.
Update: I wrote a sample program to upload a 5MB file to the same blobreference from 3 different threads. What I found out is this:
1. If the thread reuse the blobReference object, then the second and third thread will get the MD5 not match exception. But the first thread still can successfully finish the uploading.
2.If each thread create their own BlobClient/BlobReference object, then all the upload will succeed and the actual blob content matches with the one that finished the last.
However, after I increased the size of the blob upload to a 10MB file, I am getting a StorageClientException: The specified block list is invalid. What is worse is that the blob content is somehow 0 bytes after this exception. So my blob upload failed and wiped out the blob content. That seems to be a pretty serious problem. Any ideas?
At below is the sample code:
Task[] tasks = new Task[6];
tasks[0] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 0));
tasks[1] = Task.Factory.StartNew(() => ReadBlob(BlobClient,0));
tasks[2] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 1));
tasks[3] = Task.Factory.StartNew(() => ReadBlob(BlobClient, 1));
tasks[4] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 2));
tasks[5] = Task.Factory.StartNew(() => ReadBlob(BlobClient, 2));
Task.WaitAll(tasks);
DoUpload uses the following to upload:
blob.UploadFromStream(streams[i]);
Thanks.

Twisted big files transfer

I write client-server application like this:
client(c#) <-> server (twisted; ftp proxy and additional functional) <-> ftp server
Server has two classes: my own class-protocol inherited from LineReceiever protocol and FTPClient from twisted.protocols.ftp.
But when client sends or gets big files (10 Gb - 20 Gb) server catches MemoryError. I don't use any buffers in my code. It happens when after call transport.write(data) data appends to inner buffer of reactor's writers (correct me if I wrong).
What should I use to avoid this problem? Or should I change approach to the problem?
I found out that for big streams, I should use IConsumer and IProducer interfaces. But finally it will invoke transfer.write method and effect will be the same. Or am I wrong?
UPD:
Here is logic of file download/upload (from ftp through Twisted server to client on Windows):
Client sends some headers to Twisted server and after that begins send of file. Twisted server receive headers and after that (if it needs) invoke setRawMode(), open ftp connection and recieves/sends bytes from/to client and after all close connections. Here is a part of code that uploads files:
FTPManager class
def _ftpCWDSuccees(self, protocol, fileName):
self._ftpClientAsync.retrieveFile(fileName, FileReceiver(protocol))
class FileReceiver(Protocol):
def __init__(self, proto):
self.__proto = proto
def dataReceived(self, data):
self.__proto.transport.write(data)
def connectionLost(self, why = connectionDone):
self.__proto.connectionLost(why)
main proxy-server class:
class SSDMProtocol(LineReceiver)
...
After SSDMProtocol object (call obSSDMProtocol) parse headers it invoke method that open ftp connection (FTPClient from twisted.protocols.ftp) and set object of FTPManager field _ftpClientAsync and call _ftpCWDSuccees(self, protocol, fileName) with protocol = obSSDMProtocol and when file's bytes recieved invokes dataReceived(self, data) of FileReceiver object.
And when self.__proto.transport.write(data) invoked, data appends to inner buffer faster than sending back to client, therefore memory runs out. May be I can stop reading when the buffer reaches a certain size and resume reading after buffer will be all send to client? or something like that?
If you're passing a 20 gigabyte (gigabit?) string to transport.write, you're going to need at least 20 gigabytes (gigabits?) of memory - probably more like 40 or 60 due to the extra copying necessary when dealing with strings in Python.
Even if you never pass a single string to transport.write that is 20 gigabytes (gigabits?), if you repeatedly call transport.write with short strings at a rate faster than your network can handle, the send buffer will eventually grow too large to fit in memory and you'll encounter a MemoryError.
The solution to both of these problems is the producer/consumer system. The advantage that using IProducer and IConsumer gives you is that you'll never have a 20 gigabyte (gigabit?) string and you'll never fill up a send buffer with too many shorter strings. The network will be throttled so that bytes are not read faster than your application can deal with them and forget about them. Your strings will end up on the order of 16kB - 64kB, which should easily fit in memory.
You just need to adjust your use of FileReceiver to include registration of the incoming connection as a producer for the outgoing connection:
class FileReceiver(Protocol):
def __init__(self, outgoing):
self._outgoing = outgoing
def connectionMade(self):
self._outgoing.transport.registerProducer(self.transport, streaming=True)
def dataReceived(self, data):
self._outgoing.transport.write(data)
Now whenever self._outgoing.transport's send buffer fills up, it will tell self.transport to pause. Once the send buffer empties out, it will tell self.transport to resume. self.transport nows how to undertake these actions at the TCP level so that data coming into your server will also be slowed down.