I have large flat files at a location in AWS S3 and need to load the data to Teradata. I am trying to implement a python script and would need help approaching this. I am very new to bigdata and the cloud.
Related
Is there any option to load data incrementally from Amazon DynamoDB to Amazon S3 in AWS Glue. Bookmark option is enabled but It is not working. It is loading complete data. Is bookmark option not applicable for loading from Dynamodb?
It looks like Glue doesn't support job bookmarking for DynamoDB source, it only accepts S3 source :/
To load DynamoDB data incrementally you might use DynamoDB Streams to only process new data.
Enable dynamo streams for your table and use lambda to save those streams on s3. This will provide you more control over your data
I am trying to source the data from Athena or Redshift to Sage maker or AWS Forecast directly without using the flat data. In Sage maker I use Jupyter Notebook python code. Is there anyway to do so without even connecting to S3.
So far I have been using flat data which is not what I wanted.
if you're only using a SageMaker notebook instance, your data doesn't have to be in S3. You can use the boto3 SDK or a SQL connection (depending on the backend) to download data, store it locally, and work on it in your notebook.
If you're using the SageMaker SDK to train, then yes, data must be in S3. You can either do this manually if you're experimenting, or use services like AWS Glue or AWS Batch to automate your data pipeline.
Indeed, Athena data is probably already in S3, although it may be in a format that your SageMaker training code doesn't support. Creating a new table with the right SerDe (say, CSV) may be enough. If not, you can certainly get the job done with AWS Glue or Amazon EMR.
When it comes to Redshift, dumping CSV data to S3 is as easy as:
unload ('select * from mytable')
to 's3://mybucket/mytable'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter ',';
Hope this helps.
If you are using SageMaker you have to use S3 to read data, SageMaker does not read data from Redshift, but will be able to read data from Athena using PyAthena.
If your data source is in Redshift you need to load your data to S3 first to be able to use in SageMaker. If you are using Athena your data is already in S3.
Amazon Machine Learning used to support reading data from Redshift or RDS but unfortunately it's not available any more.
SageMaker Data Wrangler now allows you to read data directly from Amazon Redshift. But I'm not sure if you can from across AWS accounts (e.g. if you had one account for dev and another account for prod)
I am using AWS Glue ETL jobs to ingest some datasets using their PySpark API; namely loading a DynamicFrame from S3 objects, doing some transformations and finally writing the results in some S3 location (using glueContext.write_dynamic_frame.from_options) or a catalog (using glueContext.write_dynamic_frame.from_catalog).
To keep things organized, we have policies that prevent object creation in the target locations if they are not tagged properly.
I am wondering if there is a way to tag the s3 objects that are created as part of the write process.
It is not possible using Glue APIs. You may have to use S3 Boto library to add tags.
I am designing an application which should read a txt file from S3 every 15 min, parse the data separated by | and load this data into aerospike cluster in 3 different aws regions.
The file size can range from 0-32 GB and the no of records it may contain is between 5-130 million.
I am planning to deploy a custom Java process in every aws region which will download a file from S3 and loads into aerospike using multiple threads.
I just came across aws glue. Can anybody tell me if I can use aws glue to load this big chunk of data into aerospike? or any other recommendation to set up an efficient and performant application?
Thanks in advance!
AWS Glue does an extract, transform then loads into RedShift, EMR or Athena. You should take a look at AWS Data Pipeline instead, using the ShellCommandActivity to run your s3 data through extraction and transformation and writing the transformed data to Aerospike.
I have some data stored in S3 . I need to clone/copy this data periodically from S3 to Redshift cluster. To do bulk copy , I can use copy command to copy from S3 to redshift.
Similarly is there any trivial way to copy data from S3 to Redshift periodically .
Thanks
Try using AWS Data Pipeline which has various templates for moving data from one AWS service to other. The "Load data from S3 into Redshift" template copies data from an Amazon S3 folder into a Redshift table. You can load the data into an existing table or provide a SQL query to create the table. The Redshift table must have the same schema as the data in Amazon S3.
Data Pipeline supports pipelines to be running on a schedule. You have a cron style editor for scheduling
AWS Lambda Redshift Loader is a good solution that runs a COPY command on Redshift whenever a new file appears pre-configured location on Amazon S3.
Links:
https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/
https://github.com/awslabs/aws-lambda-redshift-loader
I believe Kinesis Firehose is the simplest way to get this done. Simply create a Kinesis Forehose stream, point it a a specific table in your Redshift cluster, write data to the stream, done :)
Full setup procedure here:
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
Kinesis option works only if redshift is publicly accessible.
You can use copy command with lambda. You can configure 2 lambdas. One will create a manifest file for you upcoming new data and another will read from that manifest for load it on redshift with Redshift data api.