AWS S3 file upload - 1 gb file - amazon-s3

I am trying to upload large files (less than 5 GB, hence not multipart upload, normal upload) using java sdk. Smaller files gets uploaded in no time. but files which are above 1 MB, doesnt upload. My code gets stuck in the lince where actual upload happens. I tried using transfer manager (TransferManager.upload) function, when I check the number of bytes transferred, it keeps transferring more than 1 MB and keeps running until I force stop my java application. what could be the reason, where am I going wrong. same code works for smaller files. Issue is only with larger files.
DefaultAWSCredentialsProviderChain credentialProviderChain = new DefaultAWSCredentialsProviderChain();
TransferManager tx = new TransferManager(credentialProviderChain.getCredentials());
Upload myUpload = tx.upload(S3bucket,fileKey, file);
while(myUpload.isDone() == false) {
System.out.println("Transfer: " + myUpload.getDescription());
System.out.println(" - State: " + myUpload.getState());
System.out.println(" - Progress: "
+ myUpload.getProgress().getBytesTransferred());
}
s3Client.upload(new PutObjectRequest(S3bucket,fileKey, file));
Tried both transfer manager upload and putobject methods. Same issue with both.
TIA.

Related

Chronicle Queue despite rolling cycle minutely deleting chronicle file after processing keeps file in open list lsof and not releasing memory

I am using chronicle queue version 5.20.123 and open JDK 11 with Linux Ubuntu 20.04, when we recycle current cycle on minute rolling I am listening on StoreFileListener onReleased I am deleting file then also file remains open without releasing memory nor file gets deleted..
Please guide what needs to be done in order to make it work.
Store FileListener Implemented like this:
storeFileListener = new StoreFileListener() {
#Override
public void onReleased(int cycle, File file) {
file.delete();
}
}
Creation of chronicle Queue as follows:
eventStore = SingleChronicleQueueBuilder.binary(GlobalConstants.CURRENT_DIR
+ GlobalConstants.PATH_SEPARATOR + EventBusConstants.EVENT_DIR
+ GlobalConstants.PATH_SEPARATOR + eventType)
.rollCycle(RollCycles.MINUTELY)
.storeFileListener(storeFileListener).build();
tailer = eventStore.createTailer();
appender = eventStore.acquireAppender();
previousCycle = tailer.cycle();
Recycling of previous Cycle when processing completes:
var store = eventStore.storeForCycle(previousCycle,0,false,null);
eventStore.closeStore(store);
Chronicle Queue Deleted Files lsof :
Manually getting hold of store and trying to close it will do nothing but interfere with reference counting - you increase and then decrease number of references.
Chronicle Queue will automatically release resources for given store after all appenders and tailers using that store are done with it. In your case it's unclear what you do with your tailer, but if it already reads from the new file - the old one will be released, and resources associated with it - although this is done in the background and might not happen immediately.
PS file.delete() returns boolean and it's always a good idea to check the return value to see if the delete was successful (in your case it can be seen it was but still it's considered a good practice)

Why we need to setReadLimit(int) in AWS S3 Java client

I am working on AWS Java S3 Library.
This is my code which is uploading the file to s3 using High-Level API of AWS.
ClientConfiguration configuration = new ClientConfiguration();
configuration.setUseGzip(true);
configuration.setConnectionTTL(1000 * 60 * 60);
AmazonS3Client amazonS3Client = new AmazonS3Client(configuration);
TransferManager transferManager = new TransferManager(amazonS3Client);
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(message.getBodyLength());
objectMetadata.setContentType("image/jpg");
transferManager.getConfiguration().setMultipartUploadThreshold(1024 * 10);
PutObjectRequest request = new PutObjectRequest("test", "/image/test", inputStream, objectMetadata);
request.getRequestClientOptions().setReadLimit(1024 * 10);
request.setSdkClientExecutionTimeout(1000 * 60 * 60);
Upload upload = transferManager.upload(request);
upload.waitForCompletion();
I am trying to upload a large file. It is working properly but sometimes I am getting below error. I have set readLimit as (1024*10).
2019-04-05 06:41:05,679 ERROR [com.demo.AwsS3TransferThread] (Aws-S3-upload) Error in saving File[media/image/osc/54/54ec3f2f-a938-473c-94b7-a55f39aac4a6.png] on S3[demo-test]: com.amazonaws.ResetException: Failed to reset the request input stream; If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.resetRequestInputStream(AmazonHttpClient.java:1221)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1042)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:948)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:661)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:635)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:618)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:586)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:573)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:445)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4041)
at com.amazonaws.services.s3.AmazonS3Client.doUploadPart(AmazonS3Client.java:3041)
at com.amazonaws.services.s3.AmazonS3Client.uploadPart(AmazonS3Client.java:3026)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadPartsInSeries(UploadCallable.java:255)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInParts(UploadCallable.java:189)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:121)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:139)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:47)
What is the perpose of readLimit?
How it will usefull?
What should I do to avoid this kind of exception?
After researching on this for 1 week,
I have found that if your uploading file size is less than 48GB then you can set readLimit value 5.01MB.
because AWS dividing file into multiple parts and each part size is value is 5MB(If you have not changed minimum part size value). as per the AWS specs, last part size is less than 5MB. so I have set readLimit 5MB and it solves the issue.
InputStream readLimit purpose:
Marks the current position in this input stream. A subsequent call to the reset method repositions this stream at the last marked position so that subsequent reads re-read the same bytes.Readlimit arguments tells this input stream to allow that many bytes to be read before the mark position gets invalidated. The general contract of mark is that, if the method markSupported returns, the stream somehow remembers all the bytes read after the call to mark and stands ready to supply those same bytes again if and whenever the method reset is called. However, the stream is not required to remember any data at all if more than readLimit bytes are read from the stream before reset is called.

Node AWS.S3 SDK upload timeout

Using the Node AWS SDK S3.upload method is not completing multi part uploads for some reason.
A readable stream that receives uploads from a browser is set as the Body (the readable stream is able to be be piped to file writableStream without any problems).
S3.upload is given the following options object:
{
partSize: 1024*1024*5,
queueSize: 1
}
When trying to upload a ~8.5mb file, the file is completely sent from the browser, but the request returned from S3.upload continually fires 'httpUploadProgress' events that indicate that all bytes have been uploaded. The following is received continually until the error occurs:
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
RequestTimeout: Your socket connection to the server was not read from
or written to within the timeout period. Idle connections will be
closed.
The progress loaded field shows that it has loaded the total bytes, but the upload is never completed. Even the end event on the readable stream fires.
Console logging in the SDK itself shows that S3.upload consumes all the available data from the readable stream even if the part size is set to 5mb and the queue size is set to 1.
Does the part size and queue size have an impact on proper usage of S3.upload? How can this problem be investigated further?
I had to use createMultipartUpload and uploadPart for my larger (8Mb) file upload.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#uploadPart-property

blob.uploadtext to the same blob from multiple threads

A beginner question but I could not find a definite answer anywhere else.
What will happen if I use blob.uploadtext() to update the same blob at the same time from two different threads? Will the latter just fail? Or the latter just overwrite?
Also, what happens if I read from the blob while the uploadtext() is in progress? Will the read return a previous version? Or will the read fail? If the read fails, should I retry?
In my scenario, I have multiple instances of the worker role that may update the same blob (of size 10 MB) from time to time. I do not really care about who wins the write as long as they do not corrupt the blob and reading the blob is not blocked for a long time.
Update: I wrote a sample program to upload a 5MB file to the same blobreference from 3 different threads. What I found out is this:
1. If the thread reuse the blobReference object, then the second and third thread will get the MD5 not match exception. But the first thread still can successfully finish the uploading.
2.If each thread create their own BlobClient/BlobReference object, then all the upload will succeed and the actual blob content matches with the one that finished the last.
However, after I increased the size of the blob upload to a 10MB file, I am getting a StorageClientException: The specified block list is invalid. What is worse is that the blob content is somehow 0 bytes after this exception. So my blob upload failed and wiped out the blob content. That seems to be a pretty serious problem. Any ideas?
At below is the sample code:
Task[] tasks = new Task[6];
tasks[0] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 0));
tasks[1] = Task.Factory.StartNew(() => ReadBlob(BlobClient,0));
tasks[2] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 1));
tasks[3] = Task.Factory.StartNew(() => ReadBlob(BlobClient, 1));
tasks[4] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 2));
tasks[5] = Task.Factory.StartNew(() => ReadBlob(BlobClient, 2));
Task.WaitAll(tasks);
DoUpload uses the following to upload:
blob.UploadFromStream(streams[i]);
Thanks.

Cascading S3 Sink Tap not being deleted with SinkMode.REPLACE

We are running Cascading with a Sink Tap being configured to store in Amazon S3 and were facing some FileAlreadyExistsException (see [1]).
This was only from time to time (1 time on around 100) and was not reproducable.
Digging into the Cascading codem, we discovered the Hfs.deleteResource() is called (among others) by the BaseFlow.deleteSinksIfNotUpdate().
Btw, we were quite intrigued with the silent NPE (with comment "hack to get around npe thrown when fs reaches root directory").
From there, we extended the Hfs tap with our own Tap to add more action in the deleteResource() method (see [2]) with a retry mechanism calling directly the getFileSystem(conf).delete.
The retry mechanism seemed to bring improvement, but we are still sometimes facing failures (see example in [3]): it sounds like HDFS returns isDeleted=true, but asking directly after if the folder exists, we receive exists=true, which should not happen. Logs also shows randomly isDeleted true or false when the flow succeeds, which sounds like the returned value is irrelevant or not to be trusted.
Can anybody bring his own S3 experience with such a behavior: "folder should be deleted, but it is not"? We suspect a S3 issue, but could it also be in Cascading or HDFS?
We run on Hadoop Cloudera-cdh3u5 and Cascading 2.0.1-wip-dev.
[1]
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.j
[2]
#Override
public boolean deleteResource(JobConf conf) throws IOException {
LOGGER.info("Deleting resource {}", getIdentifier());
boolean isDeleted = super.deleteResource(conf);
LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted,
getIdentifier());
Path path = new Path(getIdentifier());
int retryCount = 0;
int cumulativeSleepTime = 0;
int sleepTime = 1000;
while (getFileSystem(conf).exists(path)) {
LOGGER
.info(
"Resource {} still exists, it should not... - I will continue to wait patiently...",
getIdentifier());
try {
LOGGER.info("Now I will sleep " + sleepTime / 1000
+ " seconds while trying to delete {} - attempt: {}",
getIdentifier(), retryCount + 1);
Thread.sleep(sleepTime);
cumulativeSleepTime += sleepTime;
sleepTime *= 2;
} catch (InterruptedException e) {
e.printStackTrace();
LOGGER
.error(
"Interrupted while sleeping trying to delete {} with message {}...",
getIdentifier(), e.getMessage());
throw new RuntimeException(e);
}
if (retryCount == 0) {
getFileSystem(conf).delete(getPath(), true);
}
retryCount++;
if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) {
break;
}
}
if (getFileSystem(conf).exists(path)) {
LOGGER
.error(
"We didn't succeed to delete the resource {}. Throwing now a runtime exception.",
getIdentifier());
throw new RuntimeException(
"Although we waited to delete the resource for "
+ getIdentifier()
+ ' '
+ retryCount
+ " iterations, it still exists - This must be an issue in the underlying storage system.");
}
return isDeleted;
}
[3]
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1
INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://...
INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://...
ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception.
WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ...
java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system.
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179)
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40)
at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971)
at cascading.flow.BaseFlow.prepare(BaseFlow.java:733)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
First, double check the Cascading compatibility page for supported distributions.
http://www.cascading.org/support/compatibility/
Note Amazon EMR is listed as they periodically run the compatibility tests and report the results back.
Second, S3 is an eventually consistent filesystem. HDFS is not. So assumptions about the behavior of HDFS don't carry over to storing data against S3. For example, a rename is really a copy and delete. Where the copy can take hours. Amazon has patched their internal distribution to accommodate many of the differences.
Third, there are no directories in S3. It is a hack and supported differently by different S3 interfaces (jets3t vs s3cmd vs ...). This is bound to be problematic considering the prior point.
Fourth, network latency and reliability are critical, especially when communicating to S3. Historically I've found the Amazon network to be better behaved when manipulating massive datasets on S3 when using EMR vs standard EC2 instances. I also believe their is a patch in EMR that improves matters here as well.
So I'd suggest try running the EMR Apache Hadoop distribution to see if your issues clear up.
When running any jobs on Hadoop that use files in S3, the nuances of eventual consistency must be kept in mind.
I've helped troubleshoot many apps which turned out to have similar race conditions for delete as their root issue -- whether they were in Cascading or Hadoop streaming or written directly in Java.
There was discussion at one point of having notifications from S3 after a given key/value pair had been fully deleted. I haven't kept up on where that feature stood. Otherwise, it's probably best to design systems -- again, whether in Cascading or any other app that uses S3 -- such that data which is consumed or produced by a batch workflow gets managed in HDFS or HBase or a key/value framework (e.g., have used Redis for this). Then S3 gets used for durable storage, but not for intermediate data.