AWS Glue sync data from RDS (need to sync 4 table from all schema) to S3 (apache parque format) - amazon-s3

We are using a Postgres RDS instance (db.t3.2xlarge with around 2TB data). We have a multi-tenancy application so for all organizations who sign up in our product, we are creating a separate schema which replicates our data model. Now a couple of our schemas (around 5 to 10 schemas) contain a couple of big tables (around 5 to 7 big tables where each contains 10 to 200 million rows). For UI we need to show some statics as well as graphs and to calculate that statics as well as graph data we need to perform joins on big tables and it slows down our whole database server. Sometimes we need to do this type of query in night time so that users don't face any performance issues. So ss a solution we are planning to create a data lake in S3 so that all analytical load we can shift out of RDBMS and to an OLAP solution.
As a first step we need to transfer our data from RDS to S3 and also keep syncing both data sources. Can you please suggest which tool is a better choice for us considering the below requirements:
We need to update the last 3 days data on an hourly basis. We want to keep updating recent data because over the 3 day time window, it may change. After 3 days we can consider the data “at rest” and it can rest in the data lake without any future modification.
We are using a multi tenancy system currently and we are having ~350 schemas, But it will be increasing as more organizations sign up in our product.
We are planning to do ETL so in transform we are planning to join all tables and create one denormalized table and store the data in apache parque format in S3. So that we can perform analytical queries on that table using Redshift Spectrum, EMR, or some other tool.

I just found out about AWS Data Lake recently, and also based on my research (which will hopefully, assist you in the best solution possible)..
AWS Athena can partition data, and you may want to partition your data based on tenant id (customer id).
AWS Glue has crawlers:
Crawlers can run periodically to detect the availability of new data
as well as changes to existing data, including table definition
changes.

Related

Data Migration to Amazon S3 buckets

Experts: I have a situation where I need to transfer incremental data( every 5 minutes ) & daily data from an application database that has about 500+ tables to S3 for a lake house implementation. The data volumes for 5 minute interval is less than 0.5 million records. In the current world, there is SQL Server CDC that copies the data to another SQL ODS and gets into 2 different Data marts that's being used for Operational reporting.
Need your expertise to answer below questions
If we choose AWS Glue to transfer data to S3, do I need to write 500+ glue jobs one for each table? Is this right way of doing ? Are there any other tools or technologies that can transfer data easily.
If we had to do both incremental ( every 5 minute ) and also batch ( hourly/daily ), can the same jobs be used? if yes, where and how to configure the time period for extraction?
If more tables or columns get added in the source database , do I need to keep writing additional jobs or can I write a template job and call with parameters?
4.Are there any other tools ( apart from Glue ) and AWS cloud watch to monitor delays, failures & long running jobs
You can use AWS DMS to migrate data to s3 target. DMS also supports CDC. Whieh means it can also sync changes post initial migration.
To transfer data for example from on-prem to cloud, you need to have a replication instance. This can be any tier based on the size of data transfer.
Then a replication task has to be created. This can be execute immediately, or scheduled run at periodic intervals.
This use case can be solved by using AWS Database Migration Service. AWS Database Migration Service (AWS DMS) is a cloud service that makes it easy to migrate relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups.
Look at the doc for more information.
AWS Database Migration Service User Guide

Load daily MySQL DB snapshots from S3 to snowflake

I have daily MySQL DB snapshots stored on S3. This daily DB snapshot is a backup of 1000 tables in our DB, using mysqldump, size is about 300M daily (stored 1 year of snapshots, which is about 110G).
Now we want to load these snapshots daily to snowflake for reporting purpose. How do we create tables in snowflake? Shall we create 1000 tables? Will snowflake be able to handle this scenario?
All comments are welcome. Thanks!
One comment before I look at possible solutions: your statement "Our purpose is to avoid creating dimension or fact tables (typical data warehouse approach) to save cost at the beginning" is the sort of thinking that can get companies into real trouble. Once you build something and start using it, in 99% of cases you will be stuck with it - so not designing a proper, supportable, reporting solution (whether it is a Kimball model or something else) from the start is always a false economy. If you take a "quick and dirty" approach now you will regret it in a year's time.
With that out of the way, there seem to be 2 issues you need to address:
How to store your data
How to process your data (to produce you metrics and whatever else you want to do with it)
Data Storage
(Probably stating the obvious) Any tables that you create to hold metrics or which will be accessed by BI tools (including direct SQL) I would hold in Snowflake - otherwise you wont get the performance that Snowflake can deliver and there is little point using Snowflake - you might as well be using Athena directly against your S3 buckets.
For your source tables (currently in S3), in an ideal world I would also copy them into Snowflake and treat S3 as your staging area - so once the data has been copied from S3 to Snowflake you can drop the data from S3 (or archive it or do whatever you want to it).
However, if you need the S3 versions of the data for other purposes (and so can't delete it once it has been copied to Snowflake) then rather than keep duplicate copies of the data you could create External Tables in Snowflake that point to your S3 buckets and don't require you to move the data into Snowflake. Query performance against External Tables will be worse than if the tables were within Snowflake, but performance may be good enough for your purposes - especially if they are "just" being used as data sources rather than for analytical queries.
Computation
There are a number of options for the technologies you use to calculate your metrics - which one you choose is probably down to your existing skillset, cost, supportability, etc.
Snowflake functionality - Stored Procedures, External Functions (still in Preview rather than GA, I believe), etc.
External coding tools: anything that can connect to Snowflake and read/write data (e.g. Python, Spark, etc.)
ETL/ELT tool - probably overkill for your specific use case but if you are building a proper reporting platform that requires an ETL tool then obviously you could use this to create your metrics as well as move your data around
Hope this helps?

Tableau visualization - Performance issue with huge data

I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.
I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/

Building OLTP DB from Datalake?

I'm confused and having trouble finding examples and reference architecture where someone wants to extract data from an existing data lake (S3/Lakeformation in my case) and build a OLTP datastore that serves as an applications backend. Everything I come across is an OLAP data warehousing pattern (i.e. ETL -> S3 -> Redshift -> BI Tools) where data is always coming IN to the datalake and warehouse rather than being pulled OUT. I don't necessarily have a need for 'business analytics' but I do have a need for displaying graphs in web dashboards with large amounts of time series data points underneath for my websites users.
What if I want to automate pulling extracts of a large dataset in the datalake and build a relational database with some useful data extracts from the various datasets that need to be queried by the hand full instead of performing large analytical queries against a DW?
What if I just want an extract of say, stock prices over 10 years, and just get the list of unique ticker symbols for populating a drop down on a web app? I don't want to query an OLAP data warehouse every time to get this, so I want to have my own OLTP store for more performant queries on smaller datasets that will have much higher TPS?
What if I want to build dashboards for my web app's customers that display graphs of large amounts of time series data currently sitting in the datalake/warehouse. Does my web app connect directly to the DW to display this data? Or do I pull that data out of the datalake or warehouse and into my application DB on some schedule?
My views on your 3 questions:
Why not just use the same ETL solution that is being used to load the datalake?
Presumably your DW has a Ticker dimension that has unique records for each Ticker symbol? What's the issue with querying this as it would be very fast to get the unique Ticker symbols from it?
It depends entirely on your environment/infrastructure and what you are doing with the data - so there is no generic answer anyone could provide you with. If your webapp is showing aggregations of a large volume of data then your DW is probably better at doing those aggregations and passing the aggregated data to your webapp; if the webapp is showing unaggregated data (and only a small subset of what is held in your DW, such as just the last week's data) then loading it into your application DB might make more sense
The pros/cons of any solution would also be heavily influenced by your infrastructure e.g. what's the network performance like between your DW and application?

How to transform data from S3 bucket before writing to Redshift DW?

I'm creating a (modern) data warehouse in redshift. All of our infrastructure is hosted at Amazon. So far, I have setup DMS to ingest data (including changed data) from some tables of our business database (SQL Server on EC2, not RDS) and store it directly to S3.
Now I must transform and enrich this data from the S3 before I can write it to Redshift. Our DW have some tables for facts and dimensions (star schema), so, imagine a Customer dimension, it should contain not only the customer basic info, but address info, city, state, etc. This data is spread amongst a few tables in our business database.
So here's my problem, I don't have a clear idea of how to query the S3 staging area in order to join these tables and write it to my redshift DW. I want to do it using AWS services like Glue, Kinesis, etc. i.e. full serverless.
Can Kinesis accomplish this task? Would it make things easier if I moved my staging area from S3 to Redshift since all of our data is highly relational in nature anyway? If so, the question remains, how to transform/enrich data before saving it on our DW schemas? I have searched everywhere for this particular topic but information on it is scarse.
Any help is appreciated.
There are a lot of ways to do this but one idea is to query the data using Redshift Spectrum. Spectrum is a way to query S3 (called an external database) using your Redshift cluster.
Really high-level, one way to do this would be to create a Glue Crawler job to crawl your S3 bucket, which creates the External Database that Redshift Spectrum can query.
This way, you don't need to move your data into Redshift itself. Likely, you will want to keep your "staging" area in S3 and only bring into Redshift the data that is ready to be used for reporting or analytics, which would be your Customer Dim table.
Here is the documentation to do this: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
To schedule the ETL SQL: I don't believe there is a scheduling tool built into Redshift but you can do that in a few ways:
1) Get an ETL tool or set up CRON jobs on a server or Glue that schedules SQL scripts to be ran. I do this with a Python script that connects to the database then runs the SQL text. This would be a little bit more of a bulk operation. You can also do this in a Lambda function and have it be scheduled on a Cloudwatch trigger which can be on a cron schedule
2) Use a Lambda function that runs the SQL script that you want that triggers for S3 PUTs into that bucket. That way the script will run right when the file drops. This would be basically a realtime operation. DMS drops files very quickly so you will have files dropping multiple times per minute so that might be more difficult to handle.
One option is to load the 'raw' data into Redshift as 'staging' tables. Then, run SQL commands to manipulate the data (JOINs, etc) into the desired format.
Finally, copy the resulting data into the 'public' tables that users query.
This is a normal Extract-Load-Transform process (slightly different to ETL) that uses the capabilities of Redshift to do the transform.