How to read a S3 file from a REST Endpoint using Apache Camel 2.X - amazon-s3

I need to read a file from S3 bucket based on the file name passed from a REST endpoint. The file name will be passed as a uri parameter. The camel route I am using is given below.
rest().get("/file")
.to("direct:readFile");
from("direct:readFile")
.setProperty("FileName",simple("${header.fileName}"))
.log(LoggingLevel.INFO, "Reading file from S3")
.setHeader(S3Constants.KEY, simple("${header.fileName}"))
.to("aws-s3://"+awsBucket+"?amazonS3Client=#s3Client&fileName=${header.fileName}")
...
Instead of reading the file, this is currently overwriting the existing file in S3 with an empty file. How do I specify that this is a read operation and not a write operation?
I know that the latest version has an operation=getObject parameter for this. But the version I am using is 2.18.1 as the rest of the application is also on this version, and this version does not support this operation.
Is there any other way to achieve this?

Related

NiFi ListS3 Processor includes parent file path as a flow file

My files in s3
s3://my_bucket/my_path/data/category/myfile.txt
Using the ListS3 processor with the bucket and pass "my_path/data/category/" as the prefix
I will get TWO flow files:
"s3://my_bucket/my_path/data/category/myfile.txt"
and
"s3://my_bucket/my_path/data/category/"
The 2nd one here is not an actual flow file but only a path to it.
How can I change my processor configuration to only get the entry for "myfile.txt"?
Also, FetchS3 seems to be picking this up and sending it to the next processor "ExecuteScript" which is modifying the contents of the file.
This ExecuteScript processor is obviously failing but not logging it, instead, it's just stuck in the queue.
How do I make it send this to the failure path instead of being stuck in the queue?
Found the solution! There is a 'Delimiter' property in the ListS3 bucket where I needed to include '/' as a delimiter. This is used to exclude the parent directory by Amason S3.

Upload multiple files to AWS S3 bucket without overwriting existing objects

I am very new to AWS technology.
I want to add some files to an existing S3 bucket without overwriting existing objects. I am using Spring Boot technology for my project.
Can anyone please suggest how can we add/upload multiple files without overwriting existing objects?
AWS S3 supports object versioning in the bucket, in which for use case of uploading same file, S3 will keep all files within the bucket with different version rather than overwriting it.
This can be configured using AWS Console or CLI to enable the Versioning feature. You may want to refer this link for more info.
You probably already found an answer to this, but if you're using the CDK or the CLI you can specify a destinationKeyPrefix. If you want multiple folders in an S3, which was my case, the folder name will be your destinationKeyPrefix.

Flink Streaming AWS S3 read multiple files in parallel

I am new to Flink, my understanding is following API call
StreamExecutionEnvironment.getExecutionEnvironment().readFile(format, path)
will read the files in parallel for given S3 bucket path.
We are storing log files in S3. The requirement is to serve multiple client requests to read from different folders with time stamps.
For my use case, to serve multiple client request, I am evaluating to use Flink. So I want Flink to perform AWS S3 read in parallel for different AWS S3 file paths.
Is it possible to achieve this in single Flink Job. Any suggestions?
Documentation about the S3 file system support can be found here.
You can read from different directories and use the union() operator to combine all the records from different directories into one stream.
It is also possible to read nested files by using something like (untested):
TextInputFormat format = new TextInputFormat(path);
Configuration config = new Configuration();
config.setBoolean("recursive.file.enumeration", true);
format.configure(this.config);
env.readFile(format, path);

Nifi ListS3 processor not returning full path for files stored in S3

I am using the ListS3 processor to get files from S3 and piping it into the RouteOnAttribute processor. From there I am using the Route to Property name as the Routing Strategy and assigning properties bases on which files I am listening.
I am able to see all the files I want but can't do anything with them because my another processor down the line needs the full path of those files. I am using a python script, that takes in file path as cmd line arguments.
How do I extract the full absolute path of the files from S3?
You can list, download, and save S3 files locally using a sequence of NiFi processors like the following:
ListS3 - to get references to S3 objects you can filter. Output from ListS3 contains only references to objects, not the content itself, in attributes:
s3.bucket - name of the bucket, like my-bucket
filename - key of the object, like path/to/file.txt
FetchS3Object - to download object content from S3 using the bucket and key from ListS3 above.
PutFile - to store the file locally. Specify the Directory property where you want the files to be placed /path/to/directory. The filename attributes from S3 will contain relative paths from S3 keys, so these would be added to the Directory by default.
You can then assemble local paths for your Python script using NiFi expression language:
/path/to/directory/${filename}

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.