TransferManager vs PutRequestObject in AmazonS3 - amazon-s3

I am confused about Object single upload and multipart upload. Both are taking consumes same time. My code is below:
File file = new File("D:\\AmazonS3\\aws-java-sdk-1.8.3\\lib\\aws-java-sdk-1.8.3-javadoc.jar");
FileInputStream fis = new FileInputStream(file);
String keyName = System.currentTimeMillis()+"_aws-java-sdk-1.8.3-javadoc.jar";
ObjectMetadata metaData = new ObjectMetadata();
metaData.addUserMetadata("test","TEST");
//Object single upload
PutObjectRequest putobejcObjectRequest = new PutObjectRequest(BUCKET_NAME, keyName, fis,metaData);
putobejcObjectRequest.setMetadata(metaData);
s3client.putObject(putobejcObjectRequest);
//Object multipart upload
TransferManagerConfiguration configuration = new TransferManagerConfiguration();
configuration.setMultipartUploadThreshold(5*com.amazonaws.services.s3.internal.Constants.MB);
TransferManager transferManager = new TransferManager(s3client);
transferManager.setConfiguration(configuration);
Upload upload = transferManager.upload(BUCKET_NAME, keyName, fis,metaData);
upload.waitForCompletion();
transferManager.shutdownNow();
Please help me, Is there any problem in my code.

I've faced the same problem and found (by inspecting SDK code) that transfer manager will use parallel parts upload only if you pass it file, not the InputStream.
Look at decision code from sdk (version 1.8.9):
if (TransferManagerUtils.isUploadParallelizable(putObjectRequest, isUsingEncryption)) {
captureUploadStateIfPossible();
uploadPartsInParallel(requestFactory, multipartUploadId);
return null;
} else {
return uploadPartsInSeries(requestFactory);
}
where is uploadParallelizable:
// Each uploaded part in an encrypted upload depends on the encryption context
// from the previous upload, so we cannot parallelize encrypted upload parts.
if (isUsingEncryption) return false;
// Otherwise, if there's a file, we can process the uploads concurrently.
return (getRequestFile(putObjectRequest) != null);
So, if you want to get advantage of parallel upload in parts, pass a file to TransferManager.

When uploading options from a stream, callers must supply the size of
options in the stream through the content length field in the
ObjectMetadata parameter. If no content length is specified for the
input stream, then TransferManager will attempt to buffer all the
stream contents in memory and upload the options as a traditional,
single part upload. Because the entire stream contents must be
buffered in memory, this can be very expensive, and should be avoided
whenever possible.
source: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html#upload-java.lang.String-java.lang.String-java.io.InputStream-com.amazonaws.services.s3.model.ObjectMetadata-

Related

Unable to use OrcFile.createReader to read from S3

I am having trouble reading an ORC file from S3 with the OrcFile.createReader option. I am using hive-exec-2.2.0.jar at the moment and am wondering if this is supported at all? Am i missing any configuration settings? See code below. Any help will be appreciated.
String accessKey = "***";
String secretKey = "***";
Configuration configuration = new Configuration();
configuration.set("fs.s3.awsAccessKeyId", accessKey);
configuration.set("fs.s3.awsSecretAccessKey", secretKey);
configuration.set("fs.defaultFS", "s3://<bucket>");
//configuration.set("fs.default.name", "s3://<bucket>");
//configuration.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem");
FileSystem fs = FileSystem.get(configuration);
Reader reader = OrcFile.createReader(new Path("/some/path/file.orc"), OrcFile.readerOptions(configuration).filesystem(fs));
Exception - java.io.IOException: No such file.
The ReaderImpl seems to require either the fileMetadata or the OrcTail (both of which are null). Anything that i might be missing?
Update: So i managed to get out of the file not found exception by creating the s3 object with additional info (also fix the key) with
--metadata="fs=Hadoop,fs-type=block,fs-version=1". See --metadata.
It seems wrong/weird that the type of the orc file on s3 has to have those values for it to successfully retrieve the metadata.
Of course after this it barfs at getting the data, probably because the file formats differ(?).
in INode of package org.apache.hadoop.fs.s3;
public static INode deserialize(InputStream in) throws IOException {
if (in == null) {
return null;
} else {
DataInputStream dataIn = new DataInputStream(in);
INode.FileType fileType = FILE_TYPES[dataIn.readByte()];
The dataIn.readByte is returning a larger value (FILE_TYPES is an array of size 2).

An exception "The Content-MD5 you specified did not match what we received"

I got an exception, I never got before when testing my application that uploads a file from ec2 to s3. The content is:
Exception in thread "Thread-1" com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match what we received. (Service: Amazon S3; Status Code: 400; Error Code: BadDigest; Request ID: 972CB8E04388AB20), S3 Extended Request ID: T7bmFnQ2RlGWlJD+aGYfTy97XZw88pbQrwNB8YCezSjyq6O2joxHRP/6ko+Q2zZeGewkw4x/90k=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1383)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3676)
at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1439)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInOneChunk(UploadCallable.java:131)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:123)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:139)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:47)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
What can I do to fix this bug? I used the same code as before in my application.
I think I have solved my problem. I finally found that some of my files actually changed during the uploading. Because the file is generated by another thread, the uploading and generating is done at the same time. The file can not be generated immediately, and during the generating of a file, it may be uploading at the same time, the file actually changed during the uploading.
The md5 of file is created at the beginning of uploading by the AmazonS3Client, then the whole file is uploaded to the S3, at this time, the file is different from the file uploaded at beginning, so the md5 actually changed. I modified my program to a single-threading program, and the problem never turned up again.
Another reason for having this issue is to run a code such as this (python)
with open(filename, 'r') as fd:
self._bucket1.put_object(Key=key, Body=fd)
self._bucket2.put_object(Key=key, Body=fd)
In this case the file object (fd) is pointing to the end of the file when it reaches line 3, so we will get the "Content MD5" error, in order to avoid it we will need to point the file reader back to the start position in the file
with open(filename, 'r') as fd:
bucket1.put_object(Key=key, Body=fd)
fd.seek(0)
bucket2.put_object(Key=key, Body=fd)
This way we won't get the aforementioned Boto error.
I also ran into this error when I was doing something like this:
InputStream productInputStream = convertImageFileToInputStream(file);
InputStream thumbnailInputStream = generateThumbnail(productInputStream);
String uploadedFileUrl = amazonS3Uploader.uploadToS3(BUCKET_PRODUCTS_IMAGES, productFilename, productInputStream);
String uploadedThumbnailUrl = amazonS3Uploader.uploadToS3(BUCKET_PRODUCTS_IMAGES, productThumbnailFilename, thumbnailInputStream);
The generateThumbnail method was manipulating the productInputStream using a third party library. Because I couldn't modify the third party library, I simply performed the upload first:
InputStream productInputStream = convertImageFileToInputStream(file);
// do this first...
String uploadedFileUrl = amazonS3Uploader.uploadToS3(BUCKET_PRODUCTS_IMAGES, productFilename, productInputStream);
/// and then this...
InputStream thumbnailInputStream = generateThumbnail(productInputStream);
String uploadedThumbnailUrl = amazonS3Uploader.uploadToS3(BUCKET_PRODUCTS_IMAGES, productThumbnailFilename, thumbnailInputStream);
... and added this line inside my generateThumbnail method:
productInputStream.reset();
FWIW, I've managed to find a completely different way of triggering this problem, which requires a different solution.
It turns out that if you decide to assign ObjectMetadata to a PutObjectRequest explicitly, for example to specify a cacheControl setting, or a contentType, then the AWS SDK mutates the ObjectMetadata instance to stash the MD5 that it computes for the put request. This means that if you are putting multiple objects, all of which you think should have the same metadata assigned to them, you still need to create a new ObjectMetadata instance for each and every PutObjectRequest. If you don't do this, then it reuses the MD5 computed from the previous put request and you get the MD5 mismatch error on the second object you try to put.
So, to be explicit, doing something like this will fail on the second iteration:
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType("text/html");
for(Put obj: thingsToPut)
{
PutObjectRequest por =
new PutObjectRequest(bucketName, obj.s3Key, obj.file);
por = por.withMetadata(metadata);
PutObjectResult res = s3.putObject(por);
}
You need to do it like this:
for(Put obj: thingsToPut)
{
ObjectMetadata metadata = new ObjectMetadata(); // <<-- New ObjectMetadata every time!
metadata.setContentType("text/html");
PutObjectRequest por =
new PutObjectRequest(bucketName, obj.s3Key, obj.file);
por = por.withMetadata(metadata);
PutObjectResult res = s3.putObject(por);
}
I too ran into this problem. How I solved this:
I have a microservice that processes AWS SQS Messages. Each message would create multiple temporary files that would have to be uploaded to S3.
The issue was that the temporary files were named with fixed names without any salt added to them.
So between two messages, it was possible to rewrite the original file that was to be uploaded.
I fixed it by adding a random salt (this can be a UUID or the current time in millis depending on what you want) to the file names, after which the files were not being over-written and were successfully uploaded to S3.
For me it was that I used ContentLength in the params while executing upload. When it is commented out, it worked just fine.
const params = {
Bucket: "",
ContentType: "application/json",
Key: "filename.json",
// ContentLength: body.length, <--- what I have commented out
Body: body
};
await s3.upload(params).promise();

Stream pdfs from url and add it to Zip

I have a mvc 4.5 application where I show a grid. The first column of the grid is a document name. The document name is an hyper link to the actual document that is hosted on our site and is available via a url. The documents can be pdf or doc or ppt. I can access these documents only via url and I do not have access to the actual physical document on our server.
I am providing users an option to select one or many of these documents from the grid and then they can download them. What I am trying to achieve is read each of the selected documents via the url and write it to a zip file and make the zip file downloadable. So users will be downloading one file instead of multiple files.
I have tried to stream the documents via url in memory and then add it to the zip file using ZipArchive Library from Microsoft. This is not working for me.
I was able to add documents that was on disk to zip file using Zip Archive and it works great. But I do not have access to the physical document as I can access the documents only through URL. My next option is to download each of these documents into a temp location on server and then add it to zip file using Zip Archive.But I am trying to avoid downloading files into a temp location
Please suggest how I can achieve reading documents via url in memory and adding each of these document to zip file and make zip file downloadable.
Any help will be appreciated.
Thank you Cbroe for commenting. I figured the answer. The problem was I was reading the pdf from the url and convert it to a memory stream and then was trying to add the memory stream to ZipArchive which was not working but instead I extracted the byte array out of the memory stream and then added it to the zip archive and it worked.
Here is the code snippet that might be useful for some one. My first contribution to Stack OverFlow.
public FileResult DownloadZip()
{
MemoryStream memoryStream = new MemoryStream();
using (var archive = new ZipArchive(memoryStream, ZipArchiveMode.Create, true))
{
var demoFile = archive.CreateEntry("Pdf123.pdf");
var convertedStream = ConvertTobyte("http://www.example.com/Pdf123.pdf");
using (var entryStream = demoFile.Open())
{
entryStream.Write(convertedStream, 0, convertedStream.Length);
}
demoFile = archive.CreateEntry("Pdf456.pdf");
convertedStream = ConvertTobyte("http://www.example.com/Pdf456.pdf");
using (var entryStream = demoFile.Open())
{
entryStream.Write(convertedStream, 0, convertedStream.Length);
}
}
//This option is to write the zip to your local disk
using (var fileStream = new FileStream(#"C:\Temp\test.zip", FileMode.Create))
{
memoryStream.Seek(0, SeekOrigin.Begin);
memoryStream.CopyTo(fileStream);
}
//This option is to donload the zip via browser
memoryStream.Seek(0, SeekOrigin.Begin);
return new FileStreamResult(memoryStream, "application/zip")
{
FileDownloadName = "Archive.zip"
};
}
private static byte[] ConvertTobyte(string fileUrl)
{
byte[] imageData = null;
using (var wc = new System.Net.WebClient())
imageData = wc.DownloadData(fileUrl);
return imageData;
}

Winrt StreamWriter & StorageFile does not completely Overwrite File

Quick search here yielded nothing. So, I have started using some rather roundabout ways to use StreamWriter in my WinRT Application. Reading works well, writing works differently. What' I'm seeing is that when I select my file to write, if I choose a new file then no problem. The file is created as I expect. If I choose to overwrite a file, then the file is overwritten to a point, but the point where the stream stops writing, if the original file was large, then the old contents exist past where my new stream writes.
The code is as such:
public async void WriteFile(StorageFile selectedFileToSave)
{
// At this point, selectedFileToSave is from the Save File picker so can be a enw or existing file
StreamWriter writeStream;
Encoding enc = new UTF8Encoding();
Stream dotNetStream;
dotNetStream = await selectedFileToSave.OpenStreamForWriteAsync();
StreamWriter writeStream = new StreamWriter(dotNetStream, enc);
// Do writing here
// Close
writeStream.Write(Environment.NewLine);
await writeStream.FlushAsync();
await dotNetStream.FlushAsync();
}
Can anyone offer clues on what I could be missing? There are lots of functions missing in WinRT, so not really following ways to get around this
Alternatively you can set length of the stream to 0 with SetLength method before using StreamWriter:
var stream = await file.OpenStreamForWriteAsync();
stream.SetLength(0);
using (var writer = new StreamWriter(stream))
{
writer.Write(text);
}
Why not just use the helper methods in FileIO class? You could call:
FileIO.WriteTextAsync(selectedFileToSave, newTextContents);
If you really need a StreamWriter, first truncate the file by calling
FileIO.WriteBytesAsync(selectedFileToSave, new byte[0]);
And then continue with your existing code.

context path for file upload without HttpRequest in REST application

I am building REST application. I want to upload a file and I want to save it for example in /WEB-INF/resource/uploads
How can I get path to this directory ? My Controller looks like this
#RequestMapping(value = "/admin/house/update", method = RequestMethod.POST)
public String updateHouse(House house, #RequestParam("file") MultipartFile file, Model model) {
try {
String fileName = null;
InputStream inputStream = null;
OutputStream outputStream = null;
if (file.getSize() > 0) {
inputStream = file.getInputStream();
fileName = "D:/" + file.getOriginalFilename();
outputStream = new FileOutputStream(fileName);
int readBytes = 0;
byte[] buffer = new byte[10000];
while ((readBytes = inputStream.read(buffer, 0, 10000)) != -1) {
outputStream.write(buffer, 0, readBytes);
}
outputStream.close();
inputStream.close();
}
} catch(Exception ex) {
ex.printStackTrace();
}
model.addAttribute("step", 3);
this.houseDao.update(house);
return "houseAdmin";
}
Second question...what is the best place to upload user files ?
/WEB-INF is a bad place to try to store file uploads. There's no guarantee that this is an actual directory on the disk, and even if it is, the appserver may forbid write access to it.
Where you should store your files depends on what you want to do with them, and what operating system you're running on. Just pick somewhere outside of the webapp itself, is my advice. Perhaps create a dedicated directory
Also, the process of transferring the MultipartFile to another location is much simpler than you're making it out to be:
#RequestMapping(value = "/admin/house/update", method = RequestMethod.POST)
public String updateHouse(House house, #RequestParam("file") MultipartFile srcFile, Model model) throws IOException {
File destFile = new File("/path/to/the/target/file");
srcFile.transferTo(destFile); // easy!
model.addAttribute("step", 3);
this.houseDao.update(house);
return "houseAdmin";
}
You shouldn't store files in /WEB-INF/resource/uploads. This directory is either inside your WAR (if packaged) or exploded somewhere inside servlet container. The first destination is read-only and the latter should not be used for user files.
There are usually two places considered when storing uploaded files:
Some dedicated folder. Make sure users cannot access this directory directly (e.g. anonymous FTP folder). Note that once your application runs on more than one machine you won't have access to this folder. So consider some form of network synchronization or a shared network drive.
Database. This is controversial since binary files tend to occupy a lot of space. But this approach is a bit simpler when distributing your application.