Apache Camel AWS S3 Bucket Nested Directory: - amazon-s3

I am currently trying to use Apache Camel with a S3 bucket on AWS. We currently have 1 bucket for each environment that we have for our code bases (for ex: dev bucket, e2e bucket, prod bucket).
We would like to use 1 bucket with multiple directories inside to avoid the messiness and consolidate the space. We could then point camel to a folder within an endpoint to listen to (seems feasible). This has become extremely problematic while trying to configure. Camel seems to expect a bucket for each endpoint that it can connect to.
Can someone shed light on whether this is possible or if it is necessary to have 1 bucket per endpoint? The camel documentation does not specify whether it is possible.

Indeed as Alex Nano wrote, s3 folders are just prefixes to the object key (the "filename"), so if you want to set up multiple routes monitoring different s3 "folders" you could use the prefix option of the aws:s3 endpoint
Prefixes work in the same function you would use aws s3 cli,
e.g.
aws s3 ls s3://mybucket/myfolderInsideBucket/
A JUnit example
public class CamelAwsS3Test extends CamelTestSupport {
#EndpointInject(uri = "mock:result")
private MockEndpoint resultEndpoint;
#Before
public void setup() throws Exception {
context.start();
}
#Test
public void foo() throws Exception {
int expectedFileCount = 3
resultEndpoint.expectedMessageCount(expectedFileCount);
resultEndpoint.assertIsSatisfied();
}
#Override
protected JndiRegistry createRegistry() throws Exception {
JndiRegistry registry = super.createRegistry();
final String accessKey = "ACCESSKEY";
final String secretKey = "SECRETKEY";
AmazonS3 client = new AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey));
registry.bind("amazonS3Client", client);
return registry;
}
#Override
protected RoutesBuilder createRouteBuilder() throws Exception {
return new RouteBuilder() {
#Override
public void configure() throws Exception {
from("aws-s3://{BUCKET}?amazonS3Client=#amazonS3Client&prefix=${FOLDER}")
.log("hey ${body}").to("mock:result");
}
};
}
}

few thoughts:
S3 only looks like folders. in reality, it's a single name:value pair (bucket:key).
That is, only the top level 'buckets' are actually folders; what looks like a path, is just a key.
so - just use one bucket, and then filter out the key per predicate (http://camel.apache.org/predicate.html) case statement to process what you need.
I'm sure there are other ways, but this seems straightforward.
also note AWS s3 limits that justify your impulse to consolidate:
only 100 buckets per AWS account
all buckets must be uniquely named in a region (even across all AWS accounts)

Related

What are the appraoches to process large S3 file from Lambda?

In project, there is a need to process a file having a record of 50K. But, as per current implementation by AWS, the response payload size returned by a Lambda function cannot exceed 6 mb.
Can anyone help me to know what are the approaches/implementation I can carried here?
Lambda code:
#Override
public String handleRequest(S3Event event, Context context) {
String bucket = null;
String key = null;
String jsonStringFromCSV = null;
csvToJsonParser = new FileToJsonParserServiceImpl();
context.getLogger().log("Received event: " + event);
try {
if (null != event.getRecords().get(0) && event.getRecords().size() > 0) {
bucket = event.getRecords().get(0).getS3().getBucket().getName();
key = event.getRecords().get(0).getS3().getObject().getKey();
S3Object response = s3.getObject(new GetObjectRequest(bucket, key));
String contentType = response.getObjectMetadata().getContentType();
context.getLogger().log("CONTENT TYPE: " + contentType);
InputStream objectData = response.getObjectContent();
BufferedReader bufferReader = new BufferedReader(new InputStreamReader(objectData, "UTF-8"));
jsonStringFromCSV = csvToJsonParser.getMemberObjectListAsJson(bufferReader);
}
return jsonStringFromCSV;
} catch (Exception e) {
e.printStackTrace();
context.getLogger().log(String.format("Error getting object %s from bucket %s. Make sure they exist and"
+ " your bucket is in the same region as this function.", key, bucket));
throw new RuntimeException(e);
}
}
The response size is not a problem here since you're already using S3, after all, your function is going to executed asynchronously and no other system will be able to grab it's response directly (you can, however, send its response to SNS or SQS and later process it, for example).
What you can do to achieve what you want, is to have your Lambda be triggered by an S3 Event. You would then process this file (it cannot exceed 15 mins, though, so if your file is really huge, then Lambda may not be the best tool here, but 50K records - considering the records are only a couple of (k)bytes long - should be fine) and output it somewhere else (be it DynamoDB, S3, RDS, SNS, SQS, etc).
Lambda's payload size limit is definitely not a constraint for you here.
EDIT: the OP wants to store the String generated out of the CSV file somewhere, so here's one way to achieve it:
You could store that big JSON string in a file in S3 without having to invoke another Lambda to do the job. Long story short, your function would process the CSV file, generate a string out of it and then save it to an S3 file. If you need to store it in a database, you could use DynamoDB, but since DDB has limits on the objects size, you could simply create an attribute in your table called processedFilePath which points to your S3 object.
If you want to decouple things a little bit, your source Lambda could store a file in a S3, called {some_identifier}_processed.txt and then put a message in SQS with this key (or use another S3 event, but let's use SQS to see different approaches). You could then wire up another Lambda to process this SQS message and store it in DynamoDB. Your object in DynamoDB would look like:
{
"id": "2511d862-53c1-48e2-beb4-ecc6c9deaac1",
"bucket": "my-processed-files-bucket",
"processedFileKey": "{some_identifier}_processed.txt"
}
This way you can circumvent payload sizes both on Lambda and DynamoDB and simply load the file based on processedFileKey when necessary

Copy Data From Azure Blob Storage to AWS S3

I am new to Azure Data Factory and have an interesting requirement.
I need to move files from Azure Blob storage to Amazon S3, ideally using Azure Data Factory.
However S3 isnt supported as a sink;
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
I also understand from a variety of comments i've read on here that you cannot directly copy from Blob Storage to S3 - you would need to download the file locally and then upload it to S3.
Does anyone know of any examples, in Data factory, SSIS or Azure Runbook that can do such a thing, I suppose an option would be to write an azure logic-app or function that is called from Data Factory.
Managed to get something working on this - it might be useful for someone else.
I decided to write an azure function that uses a HTTP request as a trigger.
These two posts helped me a lot;
How can I use NuGet packages in my Azure Functions?
Copy from Azure Blob to AWS S3 using C#
Please note my answer to the Nuget packages if you are using Azure functions 2.x.
Here is the code - you can modify the basis of this to your needs.
I return a JSON Serialized object because Azure Data Factory requires this as a response from a http request sent from a pipeline;
#r "Microsoft.WindowsAzure.Storage"
#r "Newtonsoft.Json"
#r "System.Net.Http"
using System.Net;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Primitives;
using Newtonsoft.Json;
using Microsoft.WindowsAzure.Storage.Blob;
using System.Net.Http;
using Amazon.S3;
using Amazon.S3.Model;
using Amazon.S3.Transfer;
using Amazon.S3.Util;
public static async Task<IActionResult> Run(HttpRequest req, ILogger log)
{
log.LogInformation("Example Function has recieved a HTTP Request");
// get Params from query string
string blobUri = req.Query["blobUri"];
string bucketName = req.Query["bucketName"];
// Validate query string
if (String.IsNullOrEmpty(blobUri) || String.IsNullOrEmpty(bucketName)) {
Result outcome = new Result("Invalid Parameters Passed to Function",false,"blobUri or bucketName is null or empty");
return new BadRequestObjectResult(outcome.ConvertResultToJson());
}
// cast the blob to its type
Uri blobAbsoluteUri = new Uri(blobUri);
CloudBlockBlob blob = new CloudBlockBlob(blobAbsoluteUri);
// Do the Copy
bool resultBool = await CopyBlob(blob, bucketName, log);
if (resultBool) {
Result outcome = new Result("Copy Completed",true,"Blob: " + blobUri + " Copied to Bucket: " + bucketName);
return (ActionResult)new OkObjectResult(outcome.ConvertResultToJson());
}
else {
Result outcome = new Result("ERROR",false,"Copy was not successful Please review Application Logs");
return new BadRequestObjectResult(outcome.ConvertResultToJson());
}
}
static async Task<bool> CopyBlob(CloudBlockBlob blob, string existingBucket, ILogger log) {
var accessKey = "myAwsKey";
var secretKey = "myAwsSecret";
var keyName = blob.Name;
// Make the client
AmazonS3Client myClient = new AmazonS3Client(accessKey, secretKey, Amazon.RegionEndpoint.EUWest1);
// Check the Target Bucket Exists;
bool bucketExists = await AmazonS3Util.DoesS3BucketExistAsync (myClient,existingBucket);
if (!bucketExists) {
log.LogInformation("Bucket: " + existingBucket + " does not exist or is inaccessible to the application");
return false;
}
// Set up the Transfer Utility
TransferUtility fileTransferUtility = new TransferUtility(myClient);
// Stream the file
try {
log.LogInformation("Starting Copy");
using (var stream = await blob.OpenReadAsync()) {
// Note: You need permissions to not be private on the source blob
log.LogInformation("Streaming");
await fileTransferUtility.UploadAsync(stream,existingBucket,keyName);
log.LogInformation("Streaming Done");
}
log.LogInformation("Copy completed");
}
catch (AmazonS3Exception e) {
log.LogInformation("Error encountered on server. Message:'{0}' when writing an object", e.Message);
}
catch (Exception e) {
log.LogInformation("Unknown encountered on server. Message:'{0}' when writing an object", e.Message);
return false;
}
return true;
}
public class Result {
public string result;
public bool outcome;
public string UTCtime;
public string details;
public Result(string msg, bool outcomeBool, string fullMsg){
result=msg;
UTCtime=DateTime.Now.ToString("yyyy-MM-dd h:mm:ss tt");
outcome=outcomeBool;
details=fullMsg;
}
public string ConvertResultToJson() {
return JsonConvert.SerializeObject(this);
}
}
You can use Skyplane to copy data across clouds (110X speedup over CLI tools, with automatic compression to save on egress). To transfer from Azure blob storage to S3 you can call one of the commands:
skyplane cp -r az://azure-bucket-name/ s3://aws-bucket-name/
skyplane sync -r az://azure-bucket-name/ s3://aws-bucket-name/
ADF now includes SFTP as a sink. From the same link provided in the question (supported as a sink is the far-right column):
Using the AWS Transfer family you can set up an SFTP server and add a user with an SSH public key, then use that configuration to set up an SFTP connection from ADF that will connect directly to an S3 bucket.
Download Files From Azure Storage using AzCopy into a temporal local repository
You can download the files from Azure Cloud storage to your local system, just follow the below command, use the recursive flag to copy all the files
azcopy /Source:[source_container_url] /Dest:[local_file_path] /Sourcekey:[source_storage_account_access_key] /s
Upload Local Files to Amazon S3 using aws s3 cp command
aws s3 cp local_file_path s3://my-bucket/ --recursive

Cascading Tap Configuration for Amazon S3 to Local file System

So I'm trying out the cascading framework and I was able to run apps in local. As a next step I want to use Cascading to download files from S3 to local file system. I'm planning to use FileTap. When I google I found S3FS and looks like its deprecated. Can someone point me in the right direction. In other words, what is the right way to create a FileTap for Amazon S3 with access id and secret key like stuff.
Though this code is for the HadoopFlowConnector this will work with Local Flow connector if using the FileTap.
public class Main {
public void run(String[] args) {
Properties properties = new Properties();
String accessKey = args[0];
String secretKey = args[1];
// better put these keys to hadoop xml file
// for block file system
properties.setProperty("fs.s3.awsAccessKeyId", accessKey);
properties.setProperty("fs.s3.awsSecretAccessKey", secretKey);
// for s3 native file system
// properties.setProperty("fs.s3n.awsAccessKeyId", accessKey);
// properties.setProperty("fs.s3n.awsSecretAccessKey", secretKey);
// properties.setProperty("fs.defaultFS", "hdfs://localhost:8020/");
// properties.setProperty("fs.permissions.umask-mode", "007");
AppProps.setApplicationJarClass(properties, Main.class);
HadoopFlowConnector flowConnector = new HadoopFlowConnector(
properties);
String input = "s3://my-bucket/my-log.csv";
// If using the native S3
// String input = "s3n://my-bucket/my-log.csv";
Tap inTap = new Hfs(new TextDelimited(false, ";"), input);
Pipe copyPipe = new Pipe("copy");
Tap outTap = new Hfs(new TextDelimited(false, ";"),
"data/output");
FlowDef flowDef = FlowDef.flowDef()
.addSource(copyPipe, inTap)
.addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();
}
public static void main(String[] args) {
new Main().run(args);
}
}
Code courtesy http://whiteboxdeveloper.blogspot.com/2015/01/processing-data-from-aws-s3-using.html.
I just added the S3N string input commented line. Since I was using S3N based accessing I uncommented the S3N features and used it.

How to read an external file in Amazon Elastic MapReduce

Hi I'm new to using Amazon EMR and Hadoop. I was wondering how to read an external file (stored in S3) from an EMR job. For example, I have a file containing a long list of blacklisted strings. When my EMR job is processing my input, how do I get the job to read in this list of blacklisted strings beforehand in order to use it during processing?
I tried using a regular Java Scanner class and hardcoding the S3 path to the file but that didn't seem to work, although I could just be doing it wrong...
I'd do something like this (sorry code is scala not java, but it's the same)
Pass the path in as a argument to your main method
Set that as a property in your configuration
val conf = new Configuration()
conf.set("blacklist.file", args(0))
In the mapper's setup method, read the file:
var blacklist: List[String] = List()
override def setup(context: Context) {
val path = new Path(context.getConfiguration.get("blacklist.file"))
val fileSystem = FileSystem.get(path.toUri, context.getConfiguration)
blacklist = scala.io.Source.fromInputStream(fileSystem.open(path)).getLines.toList
}
It would be better if you may add this file to the distributed cache as follows :
...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3FilePath;
#Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
// Read the file and build a HashMap/List or something which can be accessed from map/reduce methods as desired.
...
}

Hadoop Mapreduce Configuration (New API)

I just wanted to try a tutorial (https://ccp.cloudera.com/display/DOC/Hadoop+Tutorial) program WordCount V.2 (bottom of the page) in which they are using the following method to set up some basic variables for the programm:
public void configure(JobConf job) {
...
}
However I'm trying to use the new Hadoop API and this method does not seem to exist anymore? Can anyone tell me what the equivalent way of doing something like this in the new API is?
Also how can I access my Configuration during runtime? Do I simply call:
Job.getConfiguration();
You can override the setup method in your Mapper/Reducer, this will behave like configure.
The signature looks as follows:
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
There you get a Context object, where you can call:
Configuration conf = context.getConfiguration();
map and cleanup, both have these context objects, so you can get your Configuration anytime.