Exclude specific files from S3 Cross-Region Replication - amazon-s3

I was wondering if there was a way to exclude specific files from S3 Cross-Region Replication. I am aware of the prefix option, but I have a cache folder within my bucket that I don't want to include.
Example:
I want to include the following:
images/production/image1/file.jpg
But I don't want to include this:
images/production/image1/cache/file.jpg

Seems you need to play with objects/bucket rights in order to exclude certain objects from replication:
Amazon S3 will replicate only objects in the source bucket for which
the bucket owner has permission to read objects and read ACLs
and
Amazon S3 will not replicate objects in the source bucket for which
the bucket owner does not have permissions
Maybe will be easier to move cache data in a separate bucket.

I know it's an old post but I thought it might be worth updating it with an answer that does not require meddling with the permissions.
According to Amazon's own documentation (https://docs.aws.amazon.com/AmazonS3/latest/dev/crr-how-setup.html) you can choose the objects (using a prefix in the object name or filtering by tags) that will be replicated in the Replication Configuration for the bucket:
The objects that you want to replicate—You can replicate all of the objects in >the source bucket or a subset. You identify subset by providing a key name >prefix, one or more object tags, or both in the configuration. For example, if >you configure cross-region replication to replicate only objects with the key >name prefix Tax/, Amazon S3 replicates objects with keys such as Tax/doc1 or >Tax/doc2, but not an object with the key Legal/doc3. If you specify both prefix >and one or more tags, Amazon S3 replicates only objects having specific key >prefix and the tags.
For instance, to use a prefix, set the following rule in your CRR configuration (https://docs.aws.amazon.com/AmazonS3/latest/dev/crr-add-config.html):
<Rule>
...
<Filter>
<Prefix>key-prefix</Prefix>
</Filter>

Related

terraform reference existing s3 bucket and dynamo table

From my Terraform script, I am trying to get hold of data for existing resources such as the ARN of an existing DynamoDB table and the bucket Id for an exiting S3 bucket. I've tried to use terraform_remote_state for S3, however it doesn't fit my requirements as it requires a key and I haven't found anything yet that would work for Dynamo.
Is there a solution the would work for both or would there be two separate solutions?
Many thanks in advance.
Remote state is not the concept you need - that's for storage of the tfstate file. What you require is a "data source":
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/s3_bucket
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/dynamodb_table
In Terraform, you use "Resources" to declare what things need to be created (if they don't exist), and "Data Sources" to read information from things that already exist and are not managed by Terraform.

Are s3 prefixes deleted by the lifecycle management?

Are (empty) prefixes also deleted by the s3 lifecycle management?
S3 is a glorious hash table of (key, value) pairs. The presence of '/' in the key gives the illusion of folder structure and the S3 web UI also organizes the keys in a hierarchy. So, if lifecycle management rules end up deleting all the keys with a certain prefix, then it essentially means the prefix is also deleted (basically, there is no key with such a prefix). HTH.
Short answer: yes
More detail: prefix are just 0-byte object. When you use the Amazon S3 console to create a folder, Amazon S3 creates a 0-byte object with a key that's set to the folder name that you provided. For example, if you create a folder named photos in your bucket, the Amazon S3 console creates a 0-byte object with the key photos/. The console creates this object to support the idea of folders.How S3 supports folder idea

aws s3 search sub-folders containing one specific file

I understand that s3 does not have "folder" but I will still use the term to illustrate what I am looking for.
I have this folder structure in s3:
my-bucket/folder-1/file-named-a
my-bucket/folder-2/...
my-bucket/folder-3/file-named-a
my-bucket/folder-4/...
I would like to find all folders containing "file-named-a", so folder-1 and folder-3 in above example will be returned. I only need to search the "top level" folders under my-bucket. There could be tens of thousands of folders to search. How to construct the ListObjectsRequest to do that?
Thanks,
Sam
An Amazon S3 bucket can be listed (ListBucket()) to view its contents, and this API call can be limited by a Prefix. However, it is not possible to put a wildcard within the prefix.
Therefore, you would need to retrieve the entire bucket listing, looking for these files. This would require repeated calls if there are a large number of objects.
Example: Listing Keys Using the AWS SDK for Java

How to upload multiple files to google cloud storage bucket as a transaction

Use Case:
Upload multiple files into a cloud storage bucket, and then use that data as a source to a bigquery import. Use the name of the bucket as the metadata to drive which sharded table the data should go into.
Question:
In order to prevent partial import to the bigquery table, ideally, I would like to do the following,
Upload the files into a staging bucket
Verify all files have been uploaded correctly
Rename the staging bucket to its final name (for example, gs://20130112)
Trigger the bigquery import to load the bucket into a sharded table
Since gsutil does not seem to support bucket rename, what are the alternative ways to accomplish this?
Google Cloud Storage does not support renaming buckets, or more generally an atomic way to operate on more than one object at a time.
If your main concern is that all objects were uploaded correctly (as opposed to needing to ensure the bucket content is only visible once all objects are uploaded), gsutil cp supports that -- if any object fails to upload, it will report the number that failed to upload and exit with a non-zero status.
So, a possible implementation would be a script that runs gsutil cp to upload all your files, and then checks the gsutil exit status before creating the BigQuery table load job.
Mike Schwartz, Google Cloud Storage team
Object names are actually flat in Google Cloud Storage; from the service's perspective, '/' is just another character in the name. The folder abstraction is provided by clients, like gsutil and various GUI tools. Renaming a folder requires clients to request a sequence of copy and delete operations on each object in the folder. There is no atomic way to rename a folder.
Mike Schwartz, Google Cloud Storage team

AWS: Append only mode for S3 bucket

Context
I want to have a machine upload a file dump.rdb to s3/blahblahblah/YEAR-MONTH-DAY-HOUR.rdb on the hour.
Thus, I need this machine to have the ability to upload new files to S3.
However, I don't want this machine to have the ability to (1) delete existing files or (2) overwrite existing files.
In a certain sense, it can only "append" -- it can only add in new objects.
Question:
Is there a way to configure an S3 setup like this?
Thanks!
I cannot comment yet, so here is a refinement to #Viccari 's answer...
The answer is misleading because it only addresses #1 in your requirements, not #2. In fact, it appears that it is not possible to prevent overwriting existing files, using either method, although you can enable versioning. See here: Amazon S3 ACL for read-only and write-once access.
Because you add a timestamp to your file names, you have more or less worked around the problem. (Same would be true of other schemes to encode the "version" of each file in the file name: timestamps, UUIDs, hashes.) However, note that you are not truly protected. A bug in your code, or two uploads in the same hour, would result in an overwritten file.
Yes, it is possible.
There are two ways to add permissions to a bucket and its contents: Bucket policies and Bucket ACLs. You can achieve what you want by using bucket policies. On the other hand, Bucket ACLs do not allow you to give "create" permission without giving "delete" permission as well.
1-Bucket Policies:
You can create a bucket policy (see some common examples here), allowing, for example, an specific IP address to have specific permissions.
For example, you can allow: s3:PutObject and not allow s3:DeleteObject.
More on S3 actions in bucket policies can be found here.
2-Bucket ACLs:
Using Bucket ACLs, you can only give the complete "write" permission, i.e. if a given user is able to add a file, he is also able to delete files.
This is NOT possible! S3 is a key/value store and thus inherently doesn't support append only. The PUT/cp command to S3 can always overwrite a file. By enabling versioning on your bucket you are still safe in cause the account uploading the files gets compromised.