COPY INTO <Snowflake table> from AWS S3 - amazon-s3

When we run a "COPY INTO from AWS S3 Location" command, does the data-files physically get copied from S3 to EC2-VM-Storage (SSD/Ram)? Or does the data still reside on S3 and get converted to Snowflake format?
And, if I run copy Into and then suspend the warehouse, would I lose data on resumption?
Please let me know if you need any other information.

The data is loaded onto Snowflake tables from an external location like S3. The files would still be there on S3 and if there is the requirement to remove these files post copy operation then one can use "PURGE=TRUE" parameter along with "COPY INTO" command.
The files as such will be on the S3 location, the values from it is copied to the tables in Snowflake.
Warehouse operations that are running are not affected even if the WH is shut down and is allowed to complete. So, there is no data loss in the event.

When we run a "COPY INTO from AWS S3 Location" command, Snowflake copies data file from your S3 location to Snowflake S3 storage. Snowflake S3 location is only accessible by querying the table, in which you have loaded the data.
When you suspend a warehouse, Snowflake immediately shuts down all idle compute resources for the warehouse, but allows any compute resources that are executing statements to continue until the statements complete, at which time the resources are shut down and the status of the warehouse changes to “Suspended”. Compute resources waiting to shut down are considered to be in “quiesce” mode.
More details: https://docs.snowflake.com/en/user-guide/warehouses-tasks.html#suspending-a-warehouse
Details on the loading mechanism you are using are in docs: https://docs.snowflake.com/en/user-guide/data-load-s3.html#bulk-loading-from-amazon-s3

Related

Snowflake - Loading data loading data from cloud storage

I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance
The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.

aws s3 sync cli ignoring multipart upload config when syncing between buckets

I'm trying to sync a large number of files from one bucket to another, some of the files are up to 2GB in size after using the aws cli's s3 sync command like so
aws s3 sync s3://bucket/folder/folder s3://destination-bucket/folder/folder
and verifying the files that had been transferred it became clear that the large files had lost the metadata that was present on the original file in the original bucket.
This is a "known" issue with larger files where s3 switches to multipart upload to handled the transfer.
This multipart handeling can be configured via the .aws/config file which has been done like so
[default]
s3 =
multipart_threshold = 4500MB
However when again testing the transfer the metadata on the larger files is still not present, it is present on any of the smaller files so it's clear that I'm heating the multipart upload issue.
Given this is an s3 to s3 transfer is the local s3 configuration taken into consideration at all?
As an alternative to this is there a way to just sync the metadata now that all the files have been transferred?
Have also tried doing aws s3 cp with no luck either.
You could use Cross/Same-Region Replication to copy the objects to another Amazon S3 bucket.
However, only newly added objects will copy between the buckets. You can, however, trigger the copy by copying the objects onto themselves. I'd recommend you test this on a separate bucket first, to make sure you don't accidentally lose any of the metadata.
The method suggested seems rather complex: Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena | AWS Big Data Blog
The final option would be to write your own code to copy the objects, and copy the metadata at the same time.
Or, you could write a script that compares the two buckets to see which objects did not get their correct metadata, and have it just update the metadata on the target object. This actually involves copying the object to itself, while specifying the metadata. This is probably easier than copying ALL objects yourself, since it only needs to 'fix' the ones that didn't get their metadata.
Finally managed to implement a solution for this and took the oportunity to play around with the Serverless framework and Step Functions.
The general flow I went with was:
Step Function triggered using a Cloudwatch Event Rule targetting S3 Events of the type 'CompleteMultipartUpload', as the metadata is only ever missing on S3 objects that had to be transfered using a multipart process
The initial Task on the Step Function checks if all the required MetaData is present on the object that raised the event.
If it is present then the Step Function is finished
If it is not present then the second lambda task is fired which copies all metadata from the source object to the destination object.
This could be achieved without Step Functions however was a good simple exercise to give them a go. The first 'Check Meta' task is actually redundant as the metadata is never present if multipart transfer is used, I was originally also triggering off of PutObject and CopyObject as well which is why I had the Check Meta task.

what's the use of periodically scheduling a AWS Glue crawler. Running it once seems to be enough

I've created an AWS glue table based on contents of a S3 bucket. This allows me to query data in this S3 bucket using AWS Athena. I've defined an AWS Glue crawler and run it once to auto-determine the schema of the data. This all works nicely.
Afterwards, all newly uploaded data into the S3 bucket is nicely reflected in the table. (by doing a select count(*) ... in Athena.
Why then would I need to periodically run (i.e.: schedule) an AWS Glue Crawler? After all, as said, updates to the s3 bucket seem to be properly reflected in the table. Is it to update statistics on the table so the queryplanner can be optimized or something?
Crawler is needed to register new data partitions in Data Catalog. For example, your data is located in folder /data and partitioned by date (/data/year=2018/month=9/day=11/<data-files>). Each day files are coming into a new folder (day=12, day=13 etc). To make new data available for querying these partitions must be registered in Data Catalog which can be done by running a crawler. Alternative solution is to run 'MSCK REPAIR TABLE {table-name}' in Athena.
Besides that crawler can detect a change in schema and make appropriate actions depending on your configuration.

Stream data from S3 bucket to redshift periodically

I have some data stored in S3 . I need to clone/copy this data periodically from S3 to Redshift cluster. To do bulk copy , I can use copy command to copy from S3 to redshift.
Similarly is there any trivial way to copy data from S3 to Redshift periodically .
Thanks
Try using AWS Data Pipeline which has various templates for moving data from one AWS service to other. The "Load data from S3 into Redshift" template copies data from an Amazon S3 folder into a Redshift table. You can load the data into an existing table or provide a SQL query to create the table. The Redshift table must have the same schema as the data in Amazon S3.
Data Pipeline supports pipelines to be running on a schedule. You have a cron style editor for scheduling
AWS Lambda Redshift Loader is a good solution that runs a COPY command on Redshift whenever a new file appears pre-configured location on Amazon S3.
Links:
https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/
https://github.com/awslabs/aws-lambda-redshift-loader
I believe Kinesis Firehose is the simplest way to get this done. Simply create a Kinesis Forehose stream, point it a a specific table in your Redshift cluster, write data to the stream, done :)
Full setup procedure here:
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
Kinesis option works only if redshift is publicly accessible.
You can use copy command with lambda. You can configure 2 lambdas. One will create a manifest file for you upcoming new data and another will read from that manifest for load it on redshift with Redshift data api.

How does S3 assign a timestamp upon upload?

We have a process uploading files to S3. In fact, it's indirect. We use Amazon Elastic MapReduce (EMR), and Hadoop commits the files to S3, from many different task nodes. Then, after that Hadoop job has completed successfully, another part of the process uses Hadoop's FileSystem.createNewFile() to create some files from the master node.
The files that are created from these various machines have timestamps in S3. We assume the timestamps of the files committed from the task nodes are before the files created from the master node.
I believe that is sometimes untrue, but why?
What assigns the timestamp to an S3 file? Is it the Amazon EMR Hadoop client, or some S3 machine?
If I have two machines uploading to S3 whose local clock differs by 30 minutes, will the timestamps be 30 minutes apart?
You are unable to set the Last-Modified values yourself. S3 decides them:
https://forums.aws.amazon.com/thread.jspa?messageID=209241
The only timestamp in S3 appears to be the "Last Modified" meta-data. I believe that the last modified date/time is updated by the S3 system itself, and reflects the time when the file completed uploading fully to S3 (S3 will not show incomplete transfers.)
So it shouldn't matter which node you upload a file from, the "last modified" timestamp on S3 should be consistently the same when you list it on S3.