What are the appraoches to process large S3 file from Lambda? - amazon-s3

In project, there is a need to process a file having a record of 50K. But, as per current implementation by AWS, the response payload size returned by a Lambda function cannot exceed 6 mb.
Can anyone help me to know what are the approaches/implementation I can carried here?
Lambda code:
#Override
public String handleRequest(S3Event event, Context context) {
String bucket = null;
String key = null;
String jsonStringFromCSV = null;
csvToJsonParser = new FileToJsonParserServiceImpl();
context.getLogger().log("Received event: " + event);
try {
if (null != event.getRecords().get(0) && event.getRecords().size() > 0) {
bucket = event.getRecords().get(0).getS3().getBucket().getName();
key = event.getRecords().get(0).getS3().getObject().getKey();
S3Object response = s3.getObject(new GetObjectRequest(bucket, key));
String contentType = response.getObjectMetadata().getContentType();
context.getLogger().log("CONTENT TYPE: " + contentType);
InputStream objectData = response.getObjectContent();
BufferedReader bufferReader = new BufferedReader(new InputStreamReader(objectData, "UTF-8"));
jsonStringFromCSV = csvToJsonParser.getMemberObjectListAsJson(bufferReader);
}
return jsonStringFromCSV;
} catch (Exception e) {
e.printStackTrace();
context.getLogger().log(String.format("Error getting object %s from bucket %s. Make sure they exist and"
+ " your bucket is in the same region as this function.", key, bucket));
throw new RuntimeException(e);
}
}

The response size is not a problem here since you're already using S3, after all, your function is going to executed asynchronously and no other system will be able to grab it's response directly (you can, however, send its response to SNS or SQS and later process it, for example).
What you can do to achieve what you want, is to have your Lambda be triggered by an S3 Event. You would then process this file (it cannot exceed 15 mins, though, so if your file is really huge, then Lambda may not be the best tool here, but 50K records - considering the records are only a couple of (k)bytes long - should be fine) and output it somewhere else (be it DynamoDB, S3, RDS, SNS, SQS, etc).
Lambda's payload size limit is definitely not a constraint for you here.
EDIT: the OP wants to store the String generated out of the CSV file somewhere, so here's one way to achieve it:
You could store that big JSON string in a file in S3 without having to invoke another Lambda to do the job. Long story short, your function would process the CSV file, generate a string out of it and then save it to an S3 file. If you need to store it in a database, you could use DynamoDB, but since DDB has limits on the objects size, you could simply create an attribute in your table called processedFilePath which points to your S3 object.
If you want to decouple things a little bit, your source Lambda could store a file in a S3, called {some_identifier}_processed.txt and then put a message in SQS with this key (or use another S3 event, but let's use SQS to see different approaches). You could then wire up another Lambda to process this SQS message and store it in DynamoDB. Your object in DynamoDB would look like:
{
"id": "2511d862-53c1-48e2-beb4-ecc6c9deaac1",
"bucket": "my-processed-files-bucket",
"processedFileKey": "{some_identifier}_processed.txt"
}
This way you can circumvent payload sizes both on Lambda and DynamoDB and simply load the file based on processedFileKey when necessary

Related

Very slow AWS presignedURL download. How to speed it up?

I am reading 20 - 30 different objects in varying size using the S3 IAM and a unique presignedURL for each file. The download of all files occur at once. Each phase occurs in sequence. Unfortunatly the S3Client is not thread safe so we cannot use async operations. Some files transfer rapidly while others lag. The total operation can take between 7 to > 15 seconds. I expected greater performance from S3 since AWS advertises that it has high throughput.
I see several posts that are unanswered about the download performance from S3. However the problem seems to have increased once we introduced link ambiguation using the IAM and presignedURL.
FYI my internet connection is broadband. It is unlikely the cause of the performance issue.
The tests are performed only a few hundred miles from S3 storage and eliminates distance as a factor of performance issues.
There is no server between the client and the S3 for downloading objects and is not the cause of performance issues.
One caveat. We tried using async forAllChunked using the Rice.edu habenero api. When we did not have any errors due to threading problems, the download performance was still very slow. This seemingly should eleminate the idea that download performance is slow due to it's serialization in the for loop. Albiet performance should be far better if we can download files simultainiously.
Code attatched.
public void cloudGetMedia(ArrayList<MediaSyncObj> mediaObjs, ArrayList<String> signedUrls) {
long getTime = System.currentTimeMillis();
// Ensure media directory exists or create it
String toDiskDir = DirectoryMgr.getMediaPath('M');
File diskFile = new File(toDiskDir);
FileOpsUtil.folderExists(diskFile);
// Process signedURLs
for(String signedurl : signedUrls) {
LOGGER.debug("cloudGetMedia called. signedURL is null: {}", signedurl == null);
URI fileToBeDownloaded = null;
try {
fileToBeDownloaded = new URI(signedurl);
} catch (URISyntaxException e) {
e.printStackTrace();
}
// get the file name from the presignedURL
AmazonS3URI s3URI = new AmazonS3URI(fileToBeDownloaded);
String localURL = toDiskDir + "/" + s3URI.getKey();
File file = new File(localURL);
AmazonS3 client = AmazonS3ClientBuilder.standard()
.withRegion(s3URI.getRegion())
.build();
try{
URL url = new URL(signedurl);
PresignedUrlDownloadRequest req = new PresignedUrlDownloadRequest(url);
client.download(req, file);
}
catch (MalformedURLException e) {
LOGGER.warn(e.getMessage());
e.printStackTrace();
}
}
getTime = (System.currentTimeMillis() - getTime);
LOGGER.debug("Total get time in syncCloudMediaAction: {} milliseconds, numElement: {}", getTime, signedUrls.size());
}

Copy Data From Azure Blob Storage to AWS S3

I am new to Azure Data Factory and have an interesting requirement.
I need to move files from Azure Blob storage to Amazon S3, ideally using Azure Data Factory.
However S3 isnt supported as a sink;
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
I also understand from a variety of comments i've read on here that you cannot directly copy from Blob Storage to S3 - you would need to download the file locally and then upload it to S3.
Does anyone know of any examples, in Data factory, SSIS or Azure Runbook that can do such a thing, I suppose an option would be to write an azure logic-app or function that is called from Data Factory.
Managed to get something working on this - it might be useful for someone else.
I decided to write an azure function that uses a HTTP request as a trigger.
These two posts helped me a lot;
How can I use NuGet packages in my Azure Functions?
Copy from Azure Blob to AWS S3 using C#
Please note my answer to the Nuget packages if you are using Azure functions 2.x.
Here is the code - you can modify the basis of this to your needs.
I return a JSON Serialized object because Azure Data Factory requires this as a response from a http request sent from a pipeline;
#r "Microsoft.WindowsAzure.Storage"
#r "Newtonsoft.Json"
#r "System.Net.Http"
using System.Net;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Primitives;
using Newtonsoft.Json;
using Microsoft.WindowsAzure.Storage.Blob;
using System.Net.Http;
using Amazon.S3;
using Amazon.S3.Model;
using Amazon.S3.Transfer;
using Amazon.S3.Util;
public static async Task<IActionResult> Run(HttpRequest req, ILogger log)
{
log.LogInformation("Example Function has recieved a HTTP Request");
// get Params from query string
string blobUri = req.Query["blobUri"];
string bucketName = req.Query["bucketName"];
// Validate query string
if (String.IsNullOrEmpty(blobUri) || String.IsNullOrEmpty(bucketName)) {
Result outcome = new Result("Invalid Parameters Passed to Function",false,"blobUri or bucketName is null or empty");
return new BadRequestObjectResult(outcome.ConvertResultToJson());
}
// cast the blob to its type
Uri blobAbsoluteUri = new Uri(blobUri);
CloudBlockBlob blob = new CloudBlockBlob(blobAbsoluteUri);
// Do the Copy
bool resultBool = await CopyBlob(blob, bucketName, log);
if (resultBool) {
Result outcome = new Result("Copy Completed",true,"Blob: " + blobUri + " Copied to Bucket: " + bucketName);
return (ActionResult)new OkObjectResult(outcome.ConvertResultToJson());
}
else {
Result outcome = new Result("ERROR",false,"Copy was not successful Please review Application Logs");
return new BadRequestObjectResult(outcome.ConvertResultToJson());
}
}
static async Task<bool> CopyBlob(CloudBlockBlob blob, string existingBucket, ILogger log) {
var accessKey = "myAwsKey";
var secretKey = "myAwsSecret";
var keyName = blob.Name;
// Make the client
AmazonS3Client myClient = new AmazonS3Client(accessKey, secretKey, Amazon.RegionEndpoint.EUWest1);
// Check the Target Bucket Exists;
bool bucketExists = await AmazonS3Util.DoesS3BucketExistAsync (myClient,existingBucket);
if (!bucketExists) {
log.LogInformation("Bucket: " + existingBucket + " does not exist or is inaccessible to the application");
return false;
}
// Set up the Transfer Utility
TransferUtility fileTransferUtility = new TransferUtility(myClient);
// Stream the file
try {
log.LogInformation("Starting Copy");
using (var stream = await blob.OpenReadAsync()) {
// Note: You need permissions to not be private on the source blob
log.LogInformation("Streaming");
await fileTransferUtility.UploadAsync(stream,existingBucket,keyName);
log.LogInformation("Streaming Done");
}
log.LogInformation("Copy completed");
}
catch (AmazonS3Exception e) {
log.LogInformation("Error encountered on server. Message:'{0}' when writing an object", e.Message);
}
catch (Exception e) {
log.LogInformation("Unknown encountered on server. Message:'{0}' when writing an object", e.Message);
return false;
}
return true;
}
public class Result {
public string result;
public bool outcome;
public string UTCtime;
public string details;
public Result(string msg, bool outcomeBool, string fullMsg){
result=msg;
UTCtime=DateTime.Now.ToString("yyyy-MM-dd h:mm:ss tt");
outcome=outcomeBool;
details=fullMsg;
}
public string ConvertResultToJson() {
return JsonConvert.SerializeObject(this);
}
}
You can use Skyplane to copy data across clouds (110X speedup over CLI tools, with automatic compression to save on egress). To transfer from Azure blob storage to S3 you can call one of the commands:
skyplane cp -r az://azure-bucket-name/ s3://aws-bucket-name/
skyplane sync -r az://azure-bucket-name/ s3://aws-bucket-name/
ADF now includes SFTP as a sink. From the same link provided in the question (supported as a sink is the far-right column):
Using the AWS Transfer family you can set up an SFTP server and add a user with an SSH public key, then use that configuration to set up an SFTP connection from ADF that will connect directly to an S3 bucket.
Download Files From Azure Storage using AzCopy into a temporal local repository
You can download the files from Azure Cloud storage to your local system, just follow the below command, use the recursive flag to copy all the files
azcopy /Source:[source_container_url] /Dest:[local_file_path] /Sourcekey:[source_storage_account_access_key] /s
Upload Local Files to Amazon S3 using aws s3 cp command
aws s3 cp local_file_path s3://my-bucket/ --recursive

An exception "The Content-MD5 you specified did not match what we received"

I got an exception, I never got before when testing my application that uploads a file from ec2 to s3. The content is:
Exception in thread "Thread-1" com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match what we received. (Service: Amazon S3; Status Code: 400; Error Code: BadDigest; Request ID: 972CB8E04388AB20), S3 Extended Request ID: T7bmFnQ2RlGWlJD+aGYfTy97XZw88pbQrwNB8YCezSjyq6O2joxHRP/6ko+Q2zZeGewkw4x/90k=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1383)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3676)
at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1439)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInOneChunk(UploadCallable.java:131)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:123)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:139)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:47)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
What can I do to fix this bug? I used the same code as before in my application.
I think I have solved my problem. I finally found that some of my files actually changed during the uploading. Because the file is generated by another thread, the uploading and generating is done at the same time. The file can not be generated immediately, and during the generating of a file, it may be uploading at the same time, the file actually changed during the uploading.
The md5 of file is created at the beginning of uploading by the AmazonS3Client, then the whole file is uploaded to the S3, at this time, the file is different from the file uploaded at beginning, so the md5 actually changed. I modified my program to a single-threading program, and the problem never turned up again.
Another reason for having this issue is to run a code such as this (python)
with open(filename, 'r') as fd:
self._bucket1.put_object(Key=key, Body=fd)
self._bucket2.put_object(Key=key, Body=fd)
In this case the file object (fd) is pointing to the end of the file when it reaches line 3, so we will get the "Content MD5" error, in order to avoid it we will need to point the file reader back to the start position in the file
with open(filename, 'r') as fd:
bucket1.put_object(Key=key, Body=fd)
fd.seek(0)
bucket2.put_object(Key=key, Body=fd)
This way we won't get the aforementioned Boto error.
I also ran into this error when I was doing something like this:
InputStream productInputStream = convertImageFileToInputStream(file);
InputStream thumbnailInputStream = generateThumbnail(productInputStream);
String uploadedFileUrl = amazonS3Uploader.uploadToS3(BUCKET_PRODUCTS_IMAGES, productFilename, productInputStream);
String uploadedThumbnailUrl = amazonS3Uploader.uploadToS3(BUCKET_PRODUCTS_IMAGES, productThumbnailFilename, thumbnailInputStream);
The generateThumbnail method was manipulating the productInputStream using a third party library. Because I couldn't modify the third party library, I simply performed the upload first:
InputStream productInputStream = convertImageFileToInputStream(file);
// do this first...
String uploadedFileUrl = amazonS3Uploader.uploadToS3(BUCKET_PRODUCTS_IMAGES, productFilename, productInputStream);
/// and then this...
InputStream thumbnailInputStream = generateThumbnail(productInputStream);
String uploadedThumbnailUrl = amazonS3Uploader.uploadToS3(BUCKET_PRODUCTS_IMAGES, productThumbnailFilename, thumbnailInputStream);
... and added this line inside my generateThumbnail method:
productInputStream.reset();
FWIW, I've managed to find a completely different way of triggering this problem, which requires a different solution.
It turns out that if you decide to assign ObjectMetadata to a PutObjectRequest explicitly, for example to specify a cacheControl setting, or a contentType, then the AWS SDK mutates the ObjectMetadata instance to stash the MD5 that it computes for the put request. This means that if you are putting multiple objects, all of which you think should have the same metadata assigned to them, you still need to create a new ObjectMetadata instance for each and every PutObjectRequest. If you don't do this, then it reuses the MD5 computed from the previous put request and you get the MD5 mismatch error on the second object you try to put.
So, to be explicit, doing something like this will fail on the second iteration:
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType("text/html");
for(Put obj: thingsToPut)
{
PutObjectRequest por =
new PutObjectRequest(bucketName, obj.s3Key, obj.file);
por = por.withMetadata(metadata);
PutObjectResult res = s3.putObject(por);
}
You need to do it like this:
for(Put obj: thingsToPut)
{
ObjectMetadata metadata = new ObjectMetadata(); // <<-- New ObjectMetadata every time!
metadata.setContentType("text/html");
PutObjectRequest por =
new PutObjectRequest(bucketName, obj.s3Key, obj.file);
por = por.withMetadata(metadata);
PutObjectResult res = s3.putObject(por);
}
I too ran into this problem. How I solved this:
I have a microservice that processes AWS SQS Messages. Each message would create multiple temporary files that would have to be uploaded to S3.
The issue was that the temporary files were named with fixed names without any salt added to them.
So between two messages, it was possible to rewrite the original file that was to be uploaded.
I fixed it by adding a random salt (this can be a UUID or the current time in millis depending on what you want) to the file names, after which the files were not being over-written and were successfully uploaded to S3.
For me it was that I used ContentLength in the params while executing upload. When it is commented out, it worked just fine.
const params = {
Bucket: "",
ContentType: "application/json",
Key: "filename.json",
// ContentLength: body.length, <--- what I have commented out
Body: body
};
await s3.upload(params).promise();

TransferManager vs PutRequestObject in AmazonS3

I am confused about Object single upload and multipart upload. Both are taking consumes same time. My code is below:
File file = new File("D:\\AmazonS3\\aws-java-sdk-1.8.3\\lib\\aws-java-sdk-1.8.3-javadoc.jar");
FileInputStream fis = new FileInputStream(file);
String keyName = System.currentTimeMillis()+"_aws-java-sdk-1.8.3-javadoc.jar";
ObjectMetadata metaData = new ObjectMetadata();
metaData.addUserMetadata("test","TEST");
//Object single upload
PutObjectRequest putobejcObjectRequest = new PutObjectRequest(BUCKET_NAME, keyName, fis,metaData);
putobejcObjectRequest.setMetadata(metaData);
s3client.putObject(putobejcObjectRequest);
//Object multipart upload
TransferManagerConfiguration configuration = new TransferManagerConfiguration();
configuration.setMultipartUploadThreshold(5*com.amazonaws.services.s3.internal.Constants.MB);
TransferManager transferManager = new TransferManager(s3client);
transferManager.setConfiguration(configuration);
Upload upload = transferManager.upload(BUCKET_NAME, keyName, fis,metaData);
upload.waitForCompletion();
transferManager.shutdownNow();
Please help me, Is there any problem in my code.
I've faced the same problem and found (by inspecting SDK code) that transfer manager will use parallel parts upload only if you pass it file, not the InputStream.
Look at decision code from sdk (version 1.8.9):
if (TransferManagerUtils.isUploadParallelizable(putObjectRequest, isUsingEncryption)) {
captureUploadStateIfPossible();
uploadPartsInParallel(requestFactory, multipartUploadId);
return null;
} else {
return uploadPartsInSeries(requestFactory);
}
where is uploadParallelizable:
// Each uploaded part in an encrypted upload depends on the encryption context
// from the previous upload, so we cannot parallelize encrypted upload parts.
if (isUsingEncryption) return false;
// Otherwise, if there's a file, we can process the uploads concurrently.
return (getRequestFile(putObjectRequest) != null);
So, if you want to get advantage of parallel upload in parts, pass a file to TransferManager.
When uploading options from a stream, callers must supply the size of
options in the stream through the content length field in the
ObjectMetadata parameter. If no content length is specified for the
input stream, then TransferManager will attempt to buffer all the
stream contents in memory and upload the options as a traditional,
single part upload. Because the entire stream contents must be
buffered in memory, this can be very expensive, and should be avoided
whenever possible.
source: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html#upload-java.lang.String-java.lang.String-java.io.InputStream-com.amazonaws.services.s3.model.ObjectMetadata-

How to read an external file in Amazon Elastic MapReduce

Hi I'm new to using Amazon EMR and Hadoop. I was wondering how to read an external file (stored in S3) from an EMR job. For example, I have a file containing a long list of blacklisted strings. When my EMR job is processing my input, how do I get the job to read in this list of blacklisted strings beforehand in order to use it during processing?
I tried using a regular Java Scanner class and hardcoding the S3 path to the file but that didn't seem to work, although I could just be doing it wrong...
I'd do something like this (sorry code is scala not java, but it's the same)
Pass the path in as a argument to your main method
Set that as a property in your configuration
val conf = new Configuration()
conf.set("blacklist.file", args(0))
In the mapper's setup method, read the file:
var blacklist: List[String] = List()
override def setup(context: Context) {
val path = new Path(context.getConfiguration.get("blacklist.file"))
val fileSystem = FileSystem.get(path.toUri, context.getConfiguration)
blacklist = scala.io.Source.fromInputStream(fileSystem.open(path)).getLines.toList
}
It would be better if you may add this file to the distributed cache as follows :
...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3FilePath;
#Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
// Read the file and build a HashMap/List or something which can be accessed from map/reduce methods as desired.
...
}