How to create directories in AWS S3 using Apache NiFi putS3Object

How to create directories in AWS S3 using Apache NiFi putS3Object - amazon-s3

I have a working config to push files from a directory on my server to an S3 bucket. NiFi is running on a different server so I have a getSFTP. The source files have subfolders my putS3Object current config does not support and jams all of the files at the root level of the S3 bucket. I know there's a way to get putS3Object to create directories using defined folders. The ObjectKey by default is set to ${filename}. If set to say, my/directory/${filename}, it creates two folders, my and the subfolder directory, and puts the files inside. However, I do NOT know what to set for the object key to replicate the file(s) source directories.

Try ${path}/${filename} based on this in the documentation:
Keeping with the example of a file that is picked up from a local file system, the FlowFile would have an attribute called filename that reflected the name of the file on the file system. Additionally, the FlowFile will have a path attribute that reflects the directory on the file system that this file lived in.

Related

Apache Camel eats S3 "folders" created programatically, but not ones created in AWS S3 Console

We have an Apache Camel app that is supposed to read files in a certain directory structure in S3, process the files (generating some metadata based on the folder the file is in), submit the data in the file (and metadata) to another system and finally put the consumed files into a different bucket, deleting the original from the incoming bucket.
The behaviour I'm seeing is that when I programatically create the directory structure in S3, those "folders" are being consumed, so the dir structure disappears.
I know S3 technically does not have folders, just empty files ending in /.
The twist here is that any "folder" created in the S3 Console, are NOT consumed, they stay there as we want them to. Any folders created via AWS CLI, or boto3 are immediately consumed.
The problem is that we do need the folders to be created with automation, there are too many to do by hand.
I've reached out to AWS Support, and they just tell me that there are no differences between how the Console creates folders, and how the CLI does it. Support confirmed that the command I used in CLI is correct.
I think my issue is similar to Apache Camel deleting AWS S3 bucket's folder , but that has no answer...
How can I get Camel to not "eat" any folders?

AWS s3 event ObjectRemoved - get file

I am trying to access the file that has been deleted from an s3 Bucket using aws lambdas.
I have set up a trigger for s3:ObjectRemoved*, however after extracting the bucket and file name of the deleted file, the file is deleted from s3 so I do not have access to the contents of the file.
What approach should be taken with AWS lambda to get the contents of the file after a file is deleted from an s3 bucket.

Comment proposed by #keithRozario was useful however with versioning, applying a GET request will result in a not found error as per the s3 documentation.
#Ersoy suggestion of creating a 'bin' bucket or directory/prefix with the same file name and working with that as per your requirements.
In my case copying the initial object created to a bin directory and then accessing that folder when the file is deleted from the main upload directory.

Redshift spectrum : how to import only certain files

When using redshift spectrum, it seems you can only import data providing location until a folder, and it imports all the files inside the folder.
Is there a way to import import only one file from inside a folder with many files. When providing full path with filename , I think it treats the file as a manifest file and gives errors: manifest is too large or JSON not supported.
Is there any other way?

You inadvertently answered your own question: Use a manifest file
From CREATE EXTERNAL TABLE - Amazon Redshift:
LOCATION { 's3://bucket/folder/' | 's3://bucket/manifest_file' }
The path to the Amazon S3 bucket or folder that contains the data files or a manifest file that contains a list of Amazon S3 object paths. The buckets must be in the same AWS Region as the Amazon Redshift cluster.
If the path specifies a manifest file, the s3://bucket/manifest_file argument must explicitly reference a single file—for example,'s3://mybucket/manifest.txt'. It can't reference a key prefix.
The manifest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3 and the size of the file, in bytes. The URL includes the bucket name and full object path for the file. The files that are specified in the manifest can be in different buckets, but all the buckets must be in the same AWS Region as the Amazon Redshift cluster.
I'm not sure why it requires the length of each file. It might be used to distribute the workload amongst multiple nodes.

Is it possible to solely use AWS S3 for nextcloud/opencloud drive

I'm trying to configure Nextcloud to use S3 as the sole path of all files and therefore not hold any files locally.
I guess this can be done but only within a subdirectory. Would it be possible to do it at the root path? It seems the External Storage configuration requires a folder to be entered and / does not seem to be valid.

Nifi ListS3 processor not returning full path for files stored in S3

I am using the ListS3 processor to get files from S3 and piping it into the RouteOnAttribute processor. From there I am using the Route to Property name as the Routing Strategy and assigning properties bases on which files I am listening.
I am able to see all the files I want but can't do anything with them because my another processor down the line needs the full path of those files. I am using a python script, that takes in file path as cmd line arguments.
How do I extract the full absolute path of the files from S3?

You can list, download, and save S3 files locally using a sequence of NiFi processors like the following:
ListS3 - to get references to S3 objects you can filter. Output from ListS3 contains only references to objects, not the content itself, in attributes:
s3.bucket - name of the bucket, like my-bucket
filename - key of the object, like path/to/file.txt
FetchS3Object - to download object content from S3 using the bucket and key from ListS3 above.
PutFile - to store the file locally. Specify the Directory property where you want the files to be placed /path/to/directory. The filename attributes from S3 will contain relative paths from S3 keys, so these would be added to the Directory by default.
You can then assemble local paths for your Python script using NiFi expression language:
/path/to/directory/${filename}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to create directories in AWS S3 using Apache NiFi putS3Object - amazon-s3

Related

Apache Camel eats S3 "folders" created programatically, but not ones created in AWS S3 Console

AWS s3 event ObjectRemoved - get file

Redshift spectrum : how to import only certain files

Is it possible to solely use AWS S3 for nextcloud/opencloud drive

Nifi ListS3 processor not returning full path for files stored in S3

Categories

Resources