I have some data stored in S3 . I need to clone/copy this data periodically from S3 to Redshift cluster. To do bulk copy , I can use copy command to copy from S3 to redshift.
Similarly is there any trivial way to copy data from S3 to Redshift periodically .
Thanks
Try using AWS Data Pipeline which has various templates for moving data from one AWS service to other. The "Load data from S3 into Redshift" template copies data from an Amazon S3 folder into a Redshift table. You can load the data into an existing table or provide a SQL query to create the table. The Redshift table must have the same schema as the data in Amazon S3.
Data Pipeline supports pipelines to be running on a schedule. You have a cron style editor for scheduling
AWS Lambda Redshift Loader is a good solution that runs a COPY command on Redshift whenever a new file appears pre-configured location on Amazon S3.
Links:
https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/
https://github.com/awslabs/aws-lambda-redshift-loader
I believe Kinesis Firehose is the simplest way to get this done. Simply create a Kinesis Forehose stream, point it a a specific table in your Redshift cluster, write data to the stream, done :)
Full setup procedure here:
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
Kinesis option works only if redshift is publicly accessible.
You can use copy command with lambda. You can configure 2 lambdas. One will create a manifest file for you upcoming new data and another will read from that manifest for load it on redshift with Redshift data api.
Related
When we run a "COPY INTO from AWS S3 Location" command, does the data-files physically get copied from S3 to EC2-VM-Storage (SSD/Ram)? Or does the data still reside on S3 and get converted to Snowflake format?
And, if I run copy Into and then suspend the warehouse, would I lose data on resumption?
Please let me know if you need any other information.
The data is loaded onto Snowflake tables from an external location like S3. The files would still be there on S3 and if there is the requirement to remove these files post copy operation then one can use "PURGE=TRUE" parameter along with "COPY INTO" command.
The files as such will be on the S3 location, the values from it is copied to the tables in Snowflake.
Warehouse operations that are running are not affected even if the WH is shut down and is allowed to complete. So, there is no data loss in the event.
When we run a "COPY INTO from AWS S3 Location" command, Snowflake copies data file from your S3 location to Snowflake S3 storage. Snowflake S3 location is only accessible by querying the table, in which you have loaded the data.
When you suspend a warehouse, Snowflake immediately shuts down all idle compute resources for the warehouse, but allows any compute resources that are executing statements to continue until the statements complete, at which time the resources are shut down and the status of the warehouse changes to “Suspended”. Compute resources waiting to shut down are considered to be in “quiesce” mode.
More details: https://docs.snowflake.com/en/user-guide/warehouses-tasks.html#suspending-a-warehouse
Details on the loading mechanism you are using are in docs: https://docs.snowflake.com/en/user-guide/data-load-s3.html#bulk-loading-from-amazon-s3
Is there any option to load data incrementally from Amazon DynamoDB to Amazon S3 in AWS Glue. Bookmark option is enabled but It is not working. It is loading complete data. Is bookmark option not applicable for loading from Dynamodb?
It looks like Glue doesn't support job bookmarking for DynamoDB source, it only accepts S3 source :/
To load DynamoDB data incrementally you might use DynamoDB Streams to only process new data.
Enable dynamo streams for your table and use lambda to save those streams on s3. This will provide you more control over your data
I am trying to source the data from Athena or Redshift to Sage maker or AWS Forecast directly without using the flat data. In Sage maker I use Jupyter Notebook python code. Is there anyway to do so without even connecting to S3.
So far I have been using flat data which is not what I wanted.
if you're only using a SageMaker notebook instance, your data doesn't have to be in S3. You can use the boto3 SDK or a SQL connection (depending on the backend) to download data, store it locally, and work on it in your notebook.
If you're using the SageMaker SDK to train, then yes, data must be in S3. You can either do this manually if you're experimenting, or use services like AWS Glue or AWS Batch to automate your data pipeline.
Indeed, Athena data is probably already in S3, although it may be in a format that your SageMaker training code doesn't support. Creating a new table with the right SerDe (say, CSV) may be enough. If not, you can certainly get the job done with AWS Glue or Amazon EMR.
When it comes to Redshift, dumping CSV data to S3 is as easy as:
unload ('select * from mytable')
to 's3://mybucket/mytable'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter ',';
Hope this helps.
If you are using SageMaker you have to use S3 to read data, SageMaker does not read data from Redshift, but will be able to read data from Athena using PyAthena.
If your data source is in Redshift you need to load your data to S3 first to be able to use in SageMaker. If you are using Athena your data is already in S3.
Amazon Machine Learning used to support reading data from Redshift or RDS but unfortunately it's not available any more.
SageMaker Data Wrangler now allows you to read data directly from Amazon Redshift. But I'm not sure if you can from across AWS accounts (e.g. if you had one account for dev and another account for prod)
I am designing an application which should read a txt file from S3 every 15 min, parse the data separated by | and load this data into aerospike cluster in 3 different aws regions.
The file size can range from 0-32 GB and the no of records it may contain is between 5-130 million.
I am planning to deploy a custom Java process in every aws region which will download a file from S3 and loads into aerospike using multiple threads.
I just came across aws glue. Can anybody tell me if I can use aws glue to load this big chunk of data into aerospike? or any other recommendation to set up an efficient and performant application?
Thanks in advance!
AWS Glue does an extract, transform then loads into RedShift, EMR or Athena. You should take a look at AWS Data Pipeline instead, using the ShellCommandActivity to run your s3 data through extraction and transformation and writing the transformed data to Aerospike.
I've created an AWS glue table based on contents of a S3 bucket. This allows me to query data in this S3 bucket using AWS Athena. I've defined an AWS Glue crawler and run it once to auto-determine the schema of the data. This all works nicely.
Afterwards, all newly uploaded data into the S3 bucket is nicely reflected in the table. (by doing a select count(*) ... in Athena.
Why then would I need to periodically run (i.e.: schedule) an AWS Glue Crawler? After all, as said, updates to the s3 bucket seem to be properly reflected in the table. Is it to update statistics on the table so the queryplanner can be optimized or something?
Crawler is needed to register new data partitions in Data Catalog. For example, your data is located in folder /data and partitioned by date (/data/year=2018/month=9/day=11/<data-files>). Each day files are coming into a new folder (day=12, day=13 etc). To make new data available for querying these partitions must be registered in Data Catalog which can be done by running a crawler. Alternative solution is to run 'MSCK REPAIR TABLE {table-name}' in Athena.
Besides that crawler can detect a change in schema and make appropriate actions depending on your configuration.