I have the following requirement. I'm however unsure of how to go about it
Bucket 1 contains data.
Bucket 2 should have duplicate data of Bucket 1. Whenever any file is changed in bucket 1, it is also be changed in bucket 2.
Data in bucket 2 can be independently changed. However, this data change should not be reflected in bucket 1.
This entire process must be automated and run in real time.
Depending on your needs, you might find Cross Region Replication works for you. This would require the buckets to be in separate regions. It also wouldn't copy items that were replicated from another bucket.
Essentially you just create two buckets in separate regions, create an IAM role allowing the replication, then create a Replication Configuration.
If you already have data in the source bucket that you want to appear in the target bucket, then you will also need to run a sync (you can do this as a one-off via the cli).
Another option is using AWS Lambda, which allows the buckets to be in the same region, and gives you more control should you need it. You can also replicate to multiple buckets if you want to.
Related
I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance
The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.
I am using s3 bucket to store my data. And I keep pushing data to this bucket every single day. I wonder whether there is feature I can compare the files different in my bucket between two date. I not, is there a way for me to build one via aws cli or sdk?
The reason I want to check this is that I have a s3 bucket and my clients keep pushing data to this bucket. I want to have a look how much data they pushed since the last time I load them. Is there a pattern in aws support this query? Or do I have to create any rules in s3 bucket to analyse it?
Listing from Amazon S3
You can activate Amazon S3 Inventory, which can provide a daily file listing the contents of an Amazon S3 bucket. You could then compare differences between two inventory files.
List it yourself and store it
Alternatively, you could list the contents of a bucket and look for objects dated since the last listing. However, if objects are deleted, you will only know this if you keep a list of objects that were previously in the bucket. It's probably easier to use S3 inventory.
Process it in real-time
Instead of thinking about files in batches, you could configure Amazon S3 Events to trigger something whenever a new file is uploaded to the Amazon S3 bucket. The event can:
Trigger a notification via Amazon Simple Notification Service (SNS), such as an email
Invoke an AWS Lambda function to run some code you provide. For example, the code could process the file and send it somewhere.
When I run the following:
aws s3 mb s3://toto-pillar-itg-test-export-8 --region eu-west-1
I get:
make_bucket failed: s3://toto-pillar-itg-test-export-8 An error occurred (BucketAlreadyExists) when calling the CreateBucket operation: The requested bucket name is not available. The bucket namespace is shared by all users of the system. Please select a different name and try again.
But, after, when I run the following:
aws s3 mb s3://toto-pillar-itg-test-export-8 --region us-east-1
It works well.
I don't understand why I can't create bucket in eu-west-1 region.
It's not entirely clear what operations you may have attempted, in what order, but here are some thoughts to consider:
You can't have more than one bucket with the same name, regardless of region.
No two AWS accounts can simultaneously have a bucket with the same name, regardless of region.
After creating a bucket, then deleting the bucket, there is a documented but unspecified period of time that must elapse before you -- or anyone else -- can create another bucket with the same name.
The us-east-1 region is the authoritative keeper of the global list of unique bucket names. The other regions only have a copy, so us-east-1 could be expected to be aware of the deletion of a bucket sooner than any other region, making the wait time there shorter than the wait time elsewhere.
The timing may also vary depending on whether create follows delete in the same region or a different region, or by the same account or a different account, but the contribution to the delay by these factors, if any, is not documented.
Clearly, at one point, the eu-west-1 region believed the bucket existed, as evidenced by BucketAlreadyExists, while us-east-1 did not. It may have been a coincidence of the timing of your requests, but the evidence so far suggests that before you tried any of these commands, this bucket had recently been deleted. If that is the case, this is expected behavior, and would eventually resolve itself.
After a bucket is deleted, the name becomes available to reuse, but the name might not be available for you to reuse for various reasons. For example, some other account could create a bucket with that name. Note, too, that it might take some time before the name can be reused. So if you want to use the same bucket name, don't delete the bucket. (emphasis added)
https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html
I've created an AWS glue table based on contents of a S3 bucket. This allows me to query data in this S3 bucket using AWS Athena. I've defined an AWS Glue crawler and run it once to auto-determine the schema of the data. This all works nicely.
Afterwards, all newly uploaded data into the S3 bucket is nicely reflected in the table. (by doing a select count(*) ... in Athena.
Why then would I need to periodically run (i.e.: schedule) an AWS Glue Crawler? After all, as said, updates to the s3 bucket seem to be properly reflected in the table. Is it to update statistics on the table so the queryplanner can be optimized or something?
Crawler is needed to register new data partitions in Data Catalog. For example, your data is located in folder /data and partitioned by date (/data/year=2018/month=9/day=11/<data-files>). Each day files are coming into a new folder (day=12, day=13 etc). To make new data available for querying these partitions must be registered in Data Catalog which can be done by running a crawler. Alternative solution is to run 'MSCK REPAIR TABLE {table-name}' in Athena.
Besides that crawler can detect a change in schema and make appropriate actions depending on your configuration.
I'm setting up my client with a system that allows users to upload a video or two. These videos will be stored on Amazon S3, which I've not used before. I'm unsure about buckets, and what they represent. Do you think I would have a single bucket for my application, a bucket per user or a bucket per file?
If I were to just have the one bucket, presumably I'd have to have really long, illogical file names to prevent a file name clash.
There is no limit to the amount of objects you can store in a bucket, so generally you would have a single bucket per application, or even across multiple applications. Bucket names have to be globally unique across S3 so it would certainly be impossible to manage a bucket per object. A bucket per user would also be difficult if you had more than a handful of users.
For more background on buckets you can try reading Working with Amazon S3 Buckets
Your application should generate unique keys for objects you are adding to the bucket. Try and avoid numeric ascending ids, as these are considered inefficient. Simply reversing a numeric id can usually make an effective object key. See Amazon S3 Performance Tips & Tricks for a more detailed explanation.