Loading incremental data from Dynamo DB to S3 in AWS Glue - amazon-s3

Is there any option to load data incrementally from Amazon DynamoDB to Amazon S3 in AWS Glue. Bookmark option is enabled but It is not working. It is loading complete data. Is bookmark option not applicable for loading from Dynamodb?

It looks like Glue doesn't support job bookmarking for DynamoDB source, it only accepts S3 source :/
To load DynamoDB data incrementally you might use DynamoDB Streams to only process new data.

Enable dynamo streams for your table and use lambda to save those streams on s3. This will provide you more control over your data

Related

How store streaming data from Amazon Kinesis Data Firehose to s3 bucket

I want to improve my current application. I am using redis using ElastiCache in AWS in order to store some user data from my website.
This solution is not scalable and I want to scale it using Amazon Kinesis Data Firehose for the autoscale streaming output, AWS Lambda to modify my input data, store it in S3 bucket and access it using AWS Athena.
I have been googling for several days but I really don't know how Amazon Kinesis Data Firehose store the data in S3.
Is Firehose going to store the data as a single file per each process that it will process or there is a way to add this data in the same csv or group the data in different csv's?
Amazon Kinesis Data Firehose will group data into a file based on:
Size of data (eg 5MB)
Duration (eg every 5 minutes)
Whichever one hits the limit first will trigger the data storage in Amazon S3.
Therefore, if you need near-realtime reporting, go for a short duration. Otherwise, go for larger files.
Once a file is written in Amazon S3, it is immutable and Kinesis will not modify its contents. (No appending or modification of objects.)

Tagging s3 objects written with glueContext.write_dynamic_frame

I am using AWS Glue ETL jobs to ingest some datasets using their PySpark API; namely loading a DynamicFrame from S3 objects, doing some transformations and finally writing the results in some S3 location (using glueContext.write_dynamic_frame.from_options) or a catalog (using glueContext.write_dynamic_frame.from_catalog).
To keep things organized, we have policies that prevent object creation in the target locations if they are not tagged properly.
I am wondering if there is a way to tag the s3 objects that are created as part of the write process.
It is not possible using Glue APIs. You may have to use S3 Boto library to add tags.

Can I use aws-glue to load data into aerospike?

I am designing an application which should read a txt file from S3 every 15 min, parse the data separated by | and load this data into aerospike cluster in 3 different aws regions.
The file size can range from 0-32 GB and the no of records it may contain is between 5-130 million.
I am planning to deploy a custom Java process in every aws region which will download a file from S3 and loads into aerospike using multiple threads.
I just came across aws glue. Can anybody tell me if I can use aws glue to load this big chunk of data into aerospike? or any other recommendation to set up an efficient and performant application?
Thanks in advance!
AWS Glue does an extract, transform then loads into RedShift, EMR or Athena. You should take a look at AWS Data Pipeline instead, using the ShellCommandActivity to run your s3 data through extraction and transformation and writing the transformed data to Aerospike.

what's the use of periodically scheduling a AWS Glue crawler. Running it once seems to be enough

I've created an AWS glue table based on contents of a S3 bucket. This allows me to query data in this S3 bucket using AWS Athena. I've defined an AWS Glue crawler and run it once to auto-determine the schema of the data. This all works nicely.
Afterwards, all newly uploaded data into the S3 bucket is nicely reflected in the table. (by doing a select count(*) ... in Athena.
Why then would I need to periodically run (i.e.: schedule) an AWS Glue Crawler? After all, as said, updates to the s3 bucket seem to be properly reflected in the table. Is it to update statistics on the table so the queryplanner can be optimized or something?
Crawler is needed to register new data partitions in Data Catalog. For example, your data is located in folder /data and partitioned by date (/data/year=2018/month=9/day=11/<data-files>). Each day files are coming into a new folder (day=12, day=13 etc). To make new data available for querying these partitions must be registered in Data Catalog which can be done by running a crawler. Alternative solution is to run 'MSCK REPAIR TABLE {table-name}' in Athena.
Besides that crawler can detect a change in schema and make appropriate actions depending on your configuration.

Stream data from S3 bucket to redshift periodically

I have some data stored in S3 . I need to clone/copy this data periodically from S3 to Redshift cluster. To do bulk copy , I can use copy command to copy from S3 to redshift.
Similarly is there any trivial way to copy data from S3 to Redshift periodically .
Thanks
Try using AWS Data Pipeline which has various templates for moving data from one AWS service to other. The "Load data from S3 into Redshift" template copies data from an Amazon S3 folder into a Redshift table. You can load the data into an existing table or provide a SQL query to create the table. The Redshift table must have the same schema as the data in Amazon S3.
Data Pipeline supports pipelines to be running on a schedule. You have a cron style editor for scheduling
AWS Lambda Redshift Loader is a good solution that runs a COPY command on Redshift whenever a new file appears pre-configured location on Amazon S3.
Links:
https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/
https://github.com/awslabs/aws-lambda-redshift-loader
I believe Kinesis Firehose is the simplest way to get this done. Simply create a Kinesis Forehose stream, point it a a specific table in your Redshift cluster, write data to the stream, done :)
Full setup procedure here:
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
Kinesis option works only if redshift is publicly accessible.
You can use copy command with lambda. You can configure 2 lambdas. One will create a manifest file for you upcoming new data and another will read from that manifest for load it on redshift with Redshift data api.