We have a process uploading files to S3. In fact, it's indirect. We use Amazon Elastic MapReduce (EMR), and Hadoop commits the files to S3, from many different task nodes. Then, after that Hadoop job has completed successfully, another part of the process uses Hadoop's FileSystem.createNewFile() to create some files from the master node.
The files that are created from these various machines have timestamps in S3. We assume the timestamps of the files committed from the task nodes are before the files created from the master node.
I believe that is sometimes untrue, but why?
What assigns the timestamp to an S3 file? Is it the Amazon EMR Hadoop client, or some S3 machine?
If I have two machines uploading to S3 whose local clock differs by 30 minutes, will the timestamps be 30 minutes apart?
You are unable to set the Last-Modified values yourself. S3 decides them:
https://forums.aws.amazon.com/thread.jspa?messageID=209241
The only timestamp in S3 appears to be the "Last Modified" meta-data. I believe that the last modified date/time is updated by the S3 system itself, and reflects the time when the file completed uploading fully to S3 (S3 will not show incomplete transfers.)
So it shouldn't matter which node you upload a file from, the "last modified" timestamp on S3 should be consistently the same when you list it on S3.
Related
I need to copy 1 TB of data from yandex bucket to s3 bucket. First run for full replication and then daily running it twice (every 12 hrs) so that all the new files are also synced to the s3 bucket. I have explored solutions like rclone and flexify however I am unsure what to proceed with. What would be the most optimal and cost effective solution to this problem?
When we run a "COPY INTO from AWS S3 Location" command, does the data-files physically get copied from S3 to EC2-VM-Storage (SSD/Ram)? Or does the data still reside on S3 and get converted to Snowflake format?
And, if I run copy Into and then suspend the warehouse, would I lose data on resumption?
Please let me know if you need any other information.
The data is loaded onto Snowflake tables from an external location like S3. The files would still be there on S3 and if there is the requirement to remove these files post copy operation then one can use "PURGE=TRUE" parameter along with "COPY INTO" command.
The files as such will be on the S3 location, the values from it is copied to the tables in Snowflake.
Warehouse operations that are running are not affected even if the WH is shut down and is allowed to complete. So, there is no data loss in the event.
When we run a "COPY INTO from AWS S3 Location" command, Snowflake copies data file from your S3 location to Snowflake S3 storage. Snowflake S3 location is only accessible by querying the table, in which you have loaded the data.
When you suspend a warehouse, Snowflake immediately shuts down all idle compute resources for the warehouse, but allows any compute resources that are executing statements to continue until the statements complete, at which time the resources are shut down and the status of the warehouse changes to “Suspended”. Compute resources waiting to shut down are considered to be in “quiesce” mode.
More details: https://docs.snowflake.com/en/user-guide/warehouses-tasks.html#suspending-a-warehouse
Details on the loading mechanism you are using are in docs: https://docs.snowflake.com/en/user-guide/data-load-s3.html#bulk-loading-from-amazon-s3
I am using an AWS EMR compute cluster (version 5.27.0) , which uses S3 for data persistence.
This cluster both reads and writes to S3.
S3 has an issue of eventual consistency, because of which after writing data, it cannot be immediately listed. Due to this I use EMRFS with DynamoDB to store newly written paths for immediate listing.
Problem now is that I have to set a retention policy on S3, because of which data more than a month old will get deleted from S3. However, in doing so , the data does not get deleted from EMRFS DynamoDB table, leading to consistency issues.
My question is , how can I ensure that on setting the retention policy in S3, the same paths get deleted from the DynamoDB table?
One naive solution I have come up with is to define a Lambda, which fires periodically, and sets TTL of say 1 day on the DynamoDB records manually. Is there a better approach than this ?
You can configure DynamoDB with same expiration policy as your S3 objects have
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
and in this case, you ensure both DynamoDB and S3 have the same existing objects
I am using s3 bucket to store my data. And I keep pushing data to this bucket every single day. I wonder whether there is feature I can compare the files different in my bucket between two date. I not, is there a way for me to build one via aws cli or sdk?
The reason I want to check this is that I have a s3 bucket and my clients keep pushing data to this bucket. I want to have a look how much data they pushed since the last time I load them. Is there a pattern in aws support this query? Or do I have to create any rules in s3 bucket to analyse it?
Listing from Amazon S3
You can activate Amazon S3 Inventory, which can provide a daily file listing the contents of an Amazon S3 bucket. You could then compare differences between two inventory files.
List it yourself and store it
Alternatively, you could list the contents of a bucket and look for objects dated since the last listing. However, if objects are deleted, you will only know this if you keep a list of objects that were previously in the bucket. It's probably easier to use S3 inventory.
Process it in real-time
Instead of thinking about files in batches, you could configure Amazon S3 Events to trigger something whenever a new file is uploaded to the Amazon S3 bucket. The event can:
Trigger a notification via Amazon Simple Notification Service (SNS), such as an email
Invoke an AWS Lambda function to run some code you provide. For example, the code could process the file and send it somewhere.
I've created an AWS glue table based on contents of a S3 bucket. This allows me to query data in this S3 bucket using AWS Athena. I've defined an AWS Glue crawler and run it once to auto-determine the schema of the data. This all works nicely.
Afterwards, all newly uploaded data into the S3 bucket is nicely reflected in the table. (by doing a select count(*) ... in Athena.
Why then would I need to periodically run (i.e.: schedule) an AWS Glue Crawler? After all, as said, updates to the s3 bucket seem to be properly reflected in the table. Is it to update statistics on the table so the queryplanner can be optimized or something?
Crawler is needed to register new data partitions in Data Catalog. For example, your data is located in folder /data and partitioned by date (/data/year=2018/month=9/day=11/<data-files>). Each day files are coming into a new folder (day=12, day=13 etc). To make new data available for querying these partitions must be registered in Data Catalog which can be done by running a crawler. Alternative solution is to run 'MSCK REPAIR TABLE {table-name}' in Athena.
Besides that crawler can detect a change in schema and make appropriate actions depending on your configuration.