We have been using Synapse for some time and are primarily using the Serverless Pool with Parquet, External Tables, and Views. I have a few large views that read multiple external tables and take a long time to generate. I am hoping I might be able to schedule a job to generate these as parquet files that could be read by another external table.
I am wondering if there is a way I might be able to generate a parquet file via a SQL query on the Serverless Pool.
I believe I see a way to do it using a Spark Pool, but I was curious if there might be a way to use the Serverless Pool since I think that would be a cheaper option in the long run.
Related
I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.
With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery
You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this
I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.
I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/
It seems dbt only works for a single database.
If my data is in a different database, will that still work? For example, if my datalake is using delta, but I want to run dbt using Redshift, would dbt still work for this case?
To use dbt, you need to already be able to select from your raw data in your warehouse.
In general, dbt is not an ETL tool:
[dbt] doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. This “transform after load” architecture is becoming known as ELT (extract, load, transform). dbt is the T in ELT. [reference]
So no, you cannot use dbt with Redshift and Deltalake at the same time. Instead, use a separate service to extract and load data into your Redshift cluster — dbt is agnostic about which tool you use to do this.
There is a nuance to this answer - you could use dbt to select from external files in S3 or GCS, so long as you've set up your data warehouse to be able to read those files. For Redshift, this means setting up Redshift Spectrum. (For Snowflake, this means setting up an external table and on BigQuery, you can also query cloud storage data)
So, if the data you read in Deltalake lives in S3, if you set up your Redshift cluster to be able to read it, you can use dbt to transform the data!
You can use Trino with dbt to connect to multiple databases in the same project.
The Github example project https://github.com/victorcouste/trino-dbt-demo contains a fully working setup, that you can replicate and adapt to your needs.
I would say that DBT doesn't have an execution engine, so you can not use it to move data from one source to another as it isn't processing data itself, it only sends the SQL commands to the database.
In any case, if you want to move data from S3 to Redshift, maybe you could use Redshift Spectrum where you can query S3 as external tables. There you'll be able to use DBT on S3 and Redshift data from the same system.
#willie Chen the short answer is yes you can. The more accurate answer that is not the intent of dbt. As a tool it is intended for the transform part of ETL. It serves as a transform that is already existing in a data warehouse. I agree that you should use Redshift Spectrum for ETL.
Luther
I'm creating a (modern) data warehouse in redshift. All of our infrastructure is hosted at Amazon. So far, I have setup DMS to ingest data (including changed data) from some tables of our business database (SQL Server on EC2, not RDS) and store it directly to S3.
Now I must transform and enrich this data from the S3 before I can write it to Redshift. Our DW have some tables for facts and dimensions (star schema), so, imagine a Customer dimension, it should contain not only the customer basic info, but address info, city, state, etc. This data is spread amongst a few tables in our business database.
So here's my problem, I don't have a clear idea of how to query the S3 staging area in order to join these tables and write it to my redshift DW. I want to do it using AWS services like Glue, Kinesis, etc. i.e. full serverless.
Can Kinesis accomplish this task? Would it make things easier if I moved my staging area from S3 to Redshift since all of our data is highly relational in nature anyway? If so, the question remains, how to transform/enrich data before saving it on our DW schemas? I have searched everywhere for this particular topic but information on it is scarse.
Any help is appreciated.
There are a lot of ways to do this but one idea is to query the data using Redshift Spectrum. Spectrum is a way to query S3 (called an external database) using your Redshift cluster.
Really high-level, one way to do this would be to create a Glue Crawler job to crawl your S3 bucket, which creates the External Database that Redshift Spectrum can query.
This way, you don't need to move your data into Redshift itself. Likely, you will want to keep your "staging" area in S3 and only bring into Redshift the data that is ready to be used for reporting or analytics, which would be your Customer Dim table.
Here is the documentation to do this: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
To schedule the ETL SQL: I don't believe there is a scheduling tool built into Redshift but you can do that in a few ways:
1) Get an ETL tool or set up CRON jobs on a server or Glue that schedules SQL scripts to be ran. I do this with a Python script that connects to the database then runs the SQL text. This would be a little bit more of a bulk operation. You can also do this in a Lambda function and have it be scheduled on a Cloudwatch trigger which can be on a cron schedule
2) Use a Lambda function that runs the SQL script that you want that triggers for S3 PUTs into that bucket. That way the script will run right when the file drops. This would be basically a realtime operation. DMS drops files very quickly so you will have files dropping multiple times per minute so that might be more difficult to handle.
One option is to load the 'raw' data into Redshift as 'staging' tables. Then, run SQL commands to manipulate the data (JOINs, etc) into the desired format.
Finally, copy the resulting data into the 'public' tables that users query.
This is a normal Extract-Load-Transform process (slightly different to ETL) that uses the capabilities of Redshift to do the transform.
I am trying to figure out decent but simple tool which I can host myself in AWS EC2, which will allow me to pull data out of SQL Server 2005 and push to Amazon Redshift.
I basically have a view in SQL Server on which I am doing SELECT * and I need just put all this data into Redshift. The biggest concern is that there is a lot of data, and this will need to be configurable so I can queue it, run as a nighly/continuous job, etc.
Any suggestions?
alexeypro,
dump tables to files, then you have two fundamental challenges to solve:
Transporting data to Amazon
Loading data to Redshift tables.
Amazon S3 will help you with both:
S3 supports fast upload of files to Amazon from your SQL server location. See this great article. It is from 2011 but I did some testing a few months back and saw very similar results. I was testing with gigabytes of data and 16 uploader threads were ok, as I'm not on backbone. Key thing to remember is that compression and parallel upload are your friends to cut down the time for upload.
Once data are on S3, Redshift supports high-performance parallel load from files on S3 to table(s) via COPY SQL command. To get fastest load performance pre-partition your data based on table distribution key and and pre-sort it to avoid expensive vacuums. All is well documented in Amazon's best practices. I have to say these guys know how to make things neat & simple, so just follow the steps.
If you are coder you can orchestrate the whole process remotely using scripts in whatever shell/language you want. You'll need tools/libraries for parallel HTTP upload to S3 and command line access to Redshift (psql) to launch the COPY command.
Another options is Java, there are libraries for S3 upload and JDBC access to Redshift.
As other posters suggest, you could probably use SSIS (or essentially any other ETL tool) as well. I was testing with CloverETL. Took care of automating the process as well as partitioning/presorting the files for load.
Now Microsoft released SSIS Powerpack, so you can do it natively.
SSIS Amazon Redshift Data Transfer Task
Very fast bulk copy from on-premises data to Amazon Redshift in few clicks
Load data to Amazon Redshift from traditional DB engines like SQL Server, Oracle, MySQL, DB2
Load data to Amazon Redshift from Flat Files
Automatic file archiving support
Automatic file compression support to reduce bandwidth and cost
Rich error handling and logging support to troubleshoot Redshift Datawarehouse loading issues
Support for SQL Server 2005, 2008, 2012, 2014 (32 bit and 64 bit)
Why SSIS PowerPack?
High performance suite of Custom SSIS tasks, transforms and adapters
With existing ETL tools, an alternate option to avoid staging data in Amazon (S3/Dynamo) is to use the commercial DataDirect Amazon Redshift Driver which supports a high performance load over the wire without additional dependencies to stage data.
https://blogs.datadirect.com/2014/10/recap-amazon-redshift-salesforce-data-integration-oow14.html
For getting data into Amazon Redshift, I made DataDuck http://dataducketl.com/
It's like Ruby on Rails but for building ETLs.
To give you an idea of how easy it is to set up, here's how you get your data into Redshift.
Add gem 'dataduck' to your Gemfile.
Run bundle install
Run datatduck quickstart and follow the instructions
This will autogenerate files representing the tables and columns you want to migrate to the data warehouse. You can modify these to customize it, e.g. remove or transform some of the columns.
Commit this code to your own ETL project repository
Git pull the code on your EC2 server
Run dataduck etl all on a cron job, from the EC2 server, to transfer all the tables into Amazon Redshift
Why not Python+boto+psycopg2 script?
It will run on EC2 Windows or Linux instance.
If it's OS Windows you could:
Extract data from SQL Server( using sqlcmd.exe)
Compress it (using gzip.GzipFile).
Multipart upload it to S3 (using boto)
Append it to Amazon Redshit table (using psycopg2).
Similarly, it worked for me when I wrote Oracle-To-Redshift-Data-Loader