So we uploaded about 2000 files into a bucket, using boto3, where there's no versioning enabled.
However, some of those files were already in the bucket by the same name. Is there a name to check which of those files were already in the bucket before being uploaded? The bucket has no versioning enabled which is the issue here.
We did not initially create a list of the file contents of the bucket.
Related
We have an Apache Camel app that is supposed to read files in a certain directory structure in S3, process the files (generating some metadata based on the folder the file is in), submit the data in the file (and metadata) to another system and finally put the consumed files into a different bucket, deleting the original from the incoming bucket.
The behaviour I'm seeing is that when I programatically create the directory structure in S3, those "folders" are being consumed, so the dir structure disappears.
I know S3 technically does not have folders, just empty files ending in /.
The twist here is that any "folder" created in the S3 Console, are NOT consumed, they stay there as we want them to. Any folders created via AWS CLI, or boto3 are immediately consumed.
The problem is that we do need the folders to be created with automation, there are too many to do by hand.
I've reached out to AWS Support, and they just tell me that there are no differences between how the Console creates folders, and how the CLI does it. Support confirmed that the command I used in CLI is correct.
I think my issue is similar to Apache Camel deleting AWS S3 bucket's folder , but that has no answer...
How can I get Camel to not "eat" any folders?
I am trying to access the file that has been deleted from an s3 Bucket using aws lambdas.
I have set up a trigger for s3:ObjectRemoved*, however after extracting the bucket and file name of the deleted file, the file is deleted from s3 so I do not have access to the contents of the file.
What approach should be taken with AWS lambda to get the contents of the file after a file is deleted from an s3 bucket.
Comment proposed by #keithRozario was useful however with versioning, applying a GET request will result in a not found error as per the s3 documentation.
#Ersoy suggestion of creating a 'bin' bucket or directory/prefix with the same file name and working with that as per your requirements.
In my case copying the initial object created to a bin directory and then accessing that folder when the file is deleted from the main upload directory.
I have a working config to push files from a directory on my server to an S3 bucket. NiFi is running on a different server so I have a getSFTP. The source files have subfolders my putS3Object current config does not support and jams all of the files at the root level of the S3 bucket. I know there's a way to get putS3Object to create directories using defined folders. The ObjectKey by default is set to ${filename}. If set to say, my/directory/${filename}, it creates two folders, my and the subfolder directory, and puts the files inside. However, I do NOT know what to set for the object key to replicate the file(s) source directories.
Try ${path}/${filename} based on this in the documentation:
Keeping with the example of a file that is picked up from a local file system, the FlowFile would have an attribute called filename that reflected the name of the file on the file system. Additionally, the FlowFile will have a path attribute that reflects the directory on the file system that this file lived in.
When using redshift spectrum, it seems you can only import data providing location until a folder, and it imports all the files inside the folder.
Is there a way to import import only one file from inside a folder with many files. When providing full path with filename , I think it treats the file as a manifest file and gives errors: manifest is too large or JSON not supported.
Is there any other way?
You inadvertently answered your own question: Use a manifest file
From CREATE EXTERNAL TABLE - Amazon Redshift:
LOCATION { 's3://bucket/folder/' | 's3://bucket/manifest_file' }
The path to the Amazon S3 bucket or folder that contains the data files or a manifest file that contains a list of Amazon S3 object paths. The buckets must be in the same AWS Region as the Amazon Redshift cluster.
If the path specifies a manifest file, the s3://bucket/manifest_file argument must explicitly reference a single fileāfor example,'s3://mybucket/manifest.txt'. It can't reference a key prefix.
The manifest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3 and the size of the file, in bytes. The URL includes the bucket name and full object path for the file. The files that are specified in the manifest can be in different buckets, but all the buckets must be in the same AWS Region as the Amazon Redshift cluster.
I'm not sure why it requires the length of each file. It might be used to distribute the workload amongst multiple nodes.
I search a way to replicate between S3 buckets across regions.
The purpose is that if a file accidentally deleted because a bug in my application, I would be able to restore it from the other bucket.
There is any way to do it without upload the file twice (meaning, not in the application layer)?
Set versioning on your S3 Bucket. After that it will keep all version files which you uploaded or updated in S3 Bucket. After that you can restore any version of file from version listing. See - Amazon S3 Object Lifecycle Management