Nifi ListS3 processor not returning full path for files stored in S3 - amazon-s3

I am using the ListS3 processor to get files from S3 and piping it into the RouteOnAttribute processor. From there I am using the Route to Property name as the Routing Strategy and assigning properties bases on which files I am listening.
I am able to see all the files I want but can't do anything with them because my another processor down the line needs the full path of those files. I am using a python script, that takes in file path as cmd line arguments.
How do I extract the full absolute path of the files from S3?

You can list, download, and save S3 files locally using a sequence of NiFi processors like the following:
ListS3 - to get references to S3 objects you can filter. Output from ListS3 contains only references to objects, not the content itself, in attributes:
s3.bucket - name of the bucket, like my-bucket
filename - key of the object, like path/to/file.txt
FetchS3Object - to download object content from S3 using the bucket and key from ListS3 above.
PutFile - to store the file locally. Specify the Directory property where you want the files to be placed /path/to/directory. The filename attributes from S3 will contain relative paths from S3 keys, so these would be added to the Directory by default.
You can then assemble local paths for your Python script using NiFi expression language:
/path/to/directory/${filename}

Related

NiFi ListS3 Processor includes parent file path as a flow file

My files in s3
s3://my_bucket/my_path/data/category/myfile.txt
Using the ListS3 processor with the bucket and pass "my_path/data/category/" as the prefix
I will get TWO flow files:
"s3://my_bucket/my_path/data/category/myfile.txt"
and
"s3://my_bucket/my_path/data/category/"
The 2nd one here is not an actual flow file but only a path to it.
How can I change my processor configuration to only get the entry for "myfile.txt"?
Also, FetchS3 seems to be picking this up and sending it to the next processor "ExecuteScript" which is modifying the contents of the file.
This ExecuteScript processor is obviously failing but not logging it, instead, it's just stuck in the queue.
How do I make it send this to the failure path instead of being stuck in the queue?
Found the solution! There is a 'Delimiter' property in the ListS3 bucket where I needed to include '/' as a delimiter. This is used to exclude the parent directory by Amason S3.

AWS s3 event ObjectRemoved - get file

I am trying to access the file that has been deleted from an s3 Bucket using aws lambdas.
I have set up a trigger for s3:ObjectRemoved*, however after extracting the bucket and file name of the deleted file, the file is deleted from s3 so I do not have access to the contents of the file.
What approach should be taken with AWS lambda to get the contents of the file after a file is deleted from an s3 bucket.
Comment proposed by #keithRozario was useful however with versioning, applying a GET request will result in a not found error as per the s3 documentation.
#Ersoy suggestion of creating a 'bin' bucket or directory/prefix with the same file name and working with that as per your requirements.
In my case copying the initial object created to a bin directory and then accessing that folder when the file is deleted from the main upload directory.

How to create directories in AWS S3 using Apache NiFi putS3Object

I have a working config to push files from a directory on my server to an S3 bucket. NiFi is running on a different server so I have a getSFTP. The source files have subfolders my putS3Object current config does not support and jams all of the files at the root level of the S3 bucket. I know there's a way to get putS3Object to create directories using defined folders. The ObjectKey by default is set to ${filename}. If set to say, my/directory/${filename}, it creates two folders, my and the subfolder directory, and puts the files inside. However, I do NOT know what to set for the object key to replicate the file(s) source directories.
Try ${path}/${filename} based on this in the documentation:
Keeping with the example of a file that is picked up from a local file system, the FlowFile would have an attribute called filename that reflected the name of the file on the file system. Additionally, the FlowFile will have a path attribute that reflects the directory on the file system that this file lived in.

Kafka connect s3 bucket folder

Can I create my own directory in s3 using confluent S3SinkConnector?
I know it creates a folder structure, unfortunately we need a new directory strcuture.
Additionally, if you want to completely remove the first S3 'folder' ('topics' by default), you can set the topics.dir configuration to the backspace character: \b.
This way, {bucket}/\b/{partitioner_defined_path} becomes {bucket}/{partitioner_defined_path}.
You can change the topics.dir followed by the path extracted by the partitioner.class.
If you need "a new directory structure" (quoted because S3 has no directories), then you would need to look at implementing your own Partitioner class
https://docs.confluent.io/current/connect/kafka-connect-s3/index.html#s3-object-names

Redshift spectrum : how to import only certain files

When using redshift spectrum, it seems you can only import data providing location until a folder, and it imports all the files inside the folder.
Is there a way to import import only one file from inside a folder with many files. When providing full path with filename , I think it treats the file as a manifest file and gives errors: manifest is too large or JSON not supported.
Is there any other way?
You inadvertently answered your own question: Use a manifest file
From CREATE EXTERNAL TABLE - Amazon Redshift:
LOCATION { 's3://bucket/folder/' | 's3://bucket/manifest_file' }
The path to the Amazon S3 bucket or folder that contains the data files or a manifest file that contains a list of Amazon S3 object paths. The buckets must be in the same AWS Region as the Amazon Redshift cluster.
If the path specifies a manifest file, the s3://bucket/manifest_file argument must explicitly reference a single fileā€”for example,'s3://mybucket/manifest.txt'. It can't reference a key prefix.
The manifest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3 and the size of the file, in bytes. The URL includes the bucket name and full object path for the file. The files that are specified in the manifest can be in different buckets, but all the buckets must be in the same AWS Region as the Amazon Redshift cluster.
I'm not sure why it requires the length of each file. It might be used to distribute the workload amongst multiple nodes.