I want to build a Data Lake in AWS S3 and asking my self how to work with CDC. I wanna avoid loading the whole data from the sources and furthermore I wanna avoid duplicates in the target. Are there some proven methodologies how to tackle that?
You can refer to following blog:
https://aws.amazon.com/blogs/big-data/loading-ongoing-data-lake-changes-with-aws-dms-and-aws-glue/
The deduplication is done by AWS Glue by running jobs on raw data. This dumps data to another bucket which will be mirror replication of your source database.
Related
I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance
The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.
I've never used AWS Glue however believe it will deliver what I want and am after some advice. I have a monthly CSV data upload that I push to S3 that has a staging Athena table (all strings) associated to it. I want Glue to perform a Create Table As (with all necessary convert/cast) against this dataset in Parquet format, and then move that dataset from one S3 bucket to another S3 bucket, so the primary Athena Table can access the data.
As stated, never used Glue before, and want a starter for 10, so I don't go down rabbit holes.
I currently perform all these steps manually, so want to understand how to use Glue to automate my manual tasks.
Yes, you can use AWS Glue ETL jobs to do exactly what you described. However, it doesn't perform CREATE TABLE AS SELECT queries, instead it does it with ETL jobs based on spark. Here is github repo that describes such process in quite detailed way and here is more of official AWS documentation on ETL programming based on AWS Glue service. After the initial setup, you can define some trigger events/scheduling to run your Glue ETL jobs automatically.
However, one thing to remember is cost of using AWS Glue services. Since it is based on execution time, sometimes it is not that trivial to forecast the final cost. For the workflow you described, performing CTAS queries with Athena would work just fine to transform your data and write it into a different s3 bucket. In this case you would know exactly price since it depends on the size of your data. Then you can use AWS API to do some manipulation with metadata catalog, so that new information would be accessible and in once place.
Since you are new to AWS Glue ETL jobs, I would suggest to stick with CTAS queries for simple tasks (although you can come up with quite complicated queries) and look into an open source project Apache Airflow for automation/scheduling and orchestration. This is the approach the I am using for tasks similar to yours. Airflow is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. It even has hooks to interact with AWS services. Hell, Airflow even provides you with a dedicated operator for sending queries to Athena. I wrote a little bit more about this approach here.
One of the things I see becoming more of a problem in micro-service architecture is disaster recovery. For instance, a common pattern is to store large data objects in S3 such as multimedia data, whilst JSON data would go in DynamoDB. But what happens when you have a hacker come and manages to delete a whole buck of data from your DynamoDB?
You also need to make sure your S3 bucket is restored to the same state is was at that time but are there elegant ways of doing this? The concern is that it will be difficult to guarantee that both the S3 backup and the DynamoDB database are in sync?
I am not aware of a solution to do a genuine synchronised backup-restore across services. However you could use the native DynamoDB point in time restore and the third party S3-pit-restore library to restore both services to a common point in time.
Trying to get a handle on what I would use to schedule and run jobs to move data into S3, run scripts on it and move it around s3 afterward.
My requirement is to be able to ingest from API's and also directly from databases. Some formats to ingest will be XML, and others could be flat files. The raw files need to be joined and transformed and turned into a format that graphs could be produced with.
What is AWS glue is like as an ETL tool? My specific question is can you see the finished pipelines showing the data sources and processing parts in a graphical view once they are created?
I have used Azure Data Factory - and it had a graphical UI to view and monitor the pipelines which I found quite useful. Just wondering if AWS glue has a similar thing.
If not - would Nifi on AWS S3 be a good way to do this?
Thanks
If you are looking for the best GUI, I would recommend NiFi. It is commonly used with S3 and has many connectors out of the box for other data sources. It becomes even more interesting if you want to do things outside of the AWS cloud.
That being said, I would think that Glue will also get the job done.
Running Data Factory when you have a heavy AWS footprint feels like an anti-pattern.
Full Disclosure: Have not worked with Glue/Data Factory and work for Cloudera, the driving force behind NiFi
I'm currently using AWS Glue to extract data from DB into s3, manipulate the data and save it back to Redshift/S3 or send via API to my client. AWS Glue GUI is not that good, you won't see a diagram of your flow and sometimes you will need to use other tools like step functions, airflow to orchestrate your job. Also, most of my jobs I have to use PySpark because AWS Glue methods are too limited.
Related to monitoring, you can see if there is an error, how many CPU and memory is been consumed by your job, s3 bytes read/written. If you want additional information you need to use logger or print to send it to the logs.
I would like to know if we can run ETL jobs on EFS mount files..
if so how? is it using Hive or anyother service?
Our target is to reduce all the files in one mount point to one file...and store that one file in s3 for better processing
EFS in itself does not inherently have a particular data warehouse product included. For data warehousing and ETL you can choose what you want to use that operates in the AWS environment.
On to your problem:
You want to concatenate or in some way combine all of the files currently in your EFS mount into a single file and store that in S3, if I understand it correctly.
You do not mention what type of data you have or what type of files you want to combine. That makes a huge difference in how you would do this. So I will have to give general suggestions. If you have different types of data, SQL tables from different databases, documents, non-sql data; then you need to determine how to combine that data. For that you would be looking at a data integration solution that can accommodate raw data.
Amazon has a few different products that may assist the process such as Redshift, Athena, Snowflake and their ETL solution Glue. Adding products depends on your company's needs and budget.
So, a more flexible data integration approach would be to use ELT (extract, load, transform) instead of ETL. Basically you would create an appropriate file over on your S3 instance. Then you would extract each file on EFS one at a time and load them into your S3 file. Then when you query the data in your S3 file you would perform any transformations needed before seeing the query results. Here's an article that explains the differences in more detail: https://blog.panoply.io/etl-vs-elt-the-difference-is-in-the-how.
There are some vendors supporting the ELT process such as Talend, Hadoop/Hive/Spark, Terradata and Informatica should you want to investigate options.