Transformation in Snowflake or Azure data Factory? - sql

I'm very new to Snowflake, so forgive me if the answer is obvious.
I am loading the data from on-prem into Azure using Data Factory, and then ingesting into Snowflake using COPY INTO. However, I need to enable access for some of the transformed data to other platforms, meaning that if I perform transformation in Snowflake, I'll need to create an external table in Azure (essentially pushing this data back to Azure so other platforms can access it).
As we don't particularly want to introduce a new tool, I have two options for our fairly basic transformation:
do the transformation in ADF
do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools (these platforms don't integrate with Snowflake)
Are there any major drawbacks to option 2 apart from increased storage costs?
I'm trying to weigh up the following: maintenance effort (our team's skills lie in SQL not ADF), cost, and performance.
Any advice would be appreciated.

As stated in the question, there are many possible answers for this scenario - with my favorite being the second one ("do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools").
If you need to make the results of these transformations available on Azure storage, Azure Data Factory supports this natively:
Copy data from Snowflake that utilizes Snowflake's COPY into [location] command to achieve the best performance. https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake#supported-capabilities
Or you could manage this inside Snowflake using the same COPY INTO that ADF uses.
Let me add a couple screenshots from the Snowflake webinar "Data Warehouse or Data Lake? How You Can Have Both in a Single Platform":
https://resources.snowflake.com/webinars-thought-leadership/data-warehouse-or-data-lake-how-you-can-have-both-in-a-single-platform-3

Related

Replicate data from cloud SQL postgres to bigQuery

I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.
With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery
You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this

Sql Azure - Cross database queries

I have N databases, for example 10 databases.
Every database has the same schema, but different data.
Now i would like to take every data of each database from the table "Table1" and insert them in a common table in a new database "DWHDatabase" in a table named Table1Common.
so it's an insert like n to 1.
How i can do that? i'm trying to solve my issues with the elastic queries but seems it's a 1 to 1 stuff
Use Azure Data Factory with Linked Services to each database. Use the Copy activity to load the data.
You can also paramaterize the solution.
Parameterize linked services
Parameters in Azure Data Factory by Catherine Wilhemsen
Elastic query is best suited for reporting scenarios in which the majority of the processing (filtering, aggregation) may be done on the external source side. It is unsuitable for ETL procedures involving significant amounts of data transfer from a distant database (s). Consider Azure Synapse Analytics for large reporting workloads or data warehousing applications with more sophisticated queries.
You may use the Copy activity to copy data across on-premises and
cloud-based data storage. After you've copied the data, you may use
other actions to alter and analyse it. The Copy activity may also be
used to publish transformation and analysis findings for use in
business intelligence (BI) and application consumption.
MSFT Copy Activity Overview: Here.

Tableau visualization - Performance issue with huge data

I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.
I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/

General question about ETL solutions for Azure for a small operation

The way we use data is either retrieving survey data from other organizations, or creating survey instruments ourselves and soliciting organizations under our organization for data.
We have a database where our largest table is perhaps 10 million records. We extract and upload most of our data on an annual basis, with occasionally needing to ETL over large numbers of tables from organizations such as the Census, American Community Survey, etc. Our database is all on Azure and currently the way that I get databases from Census flat files/.csv files is by re-saving them as Excel and using the Excel import wizard.
All of the 'T' in ETL is happening within programmed procedures within my staging database before moving those tables (using Visual Studio) to our reporting database.
Is there a more sophisticated technology I should be using, and if so, what is it? All of my education in this matter comes from perusing Google and watching YouTube, so my grasp on all of the different terminology is lacking and searching on the internet for ETL is making it difficult to get to what I believe should be a simple answer.
For a while I thought we wanted to eventually graduate to using SSIS, but I learned that SSIS was something that was used primarily if you had a database on prem. I've tried looking at dynamic SQL using BULK INSERT to find that BULK INSERT doesn't work with Azure DBs. Etc.
Recently I've been learning about Azure Data Factory and something called Bulk Copy Program using Windows Power Shell.
Does anybody have any suggestions as to what technology I should look at for a small-scale BI reporting solution?
I suggest you using the Data Factory, it has good performance for the large data transfer.
Refence here: Copy performance and scalability achievable using ADF
Copy Active supports you using table data, query or stored procedure to filter data in Source:
Sink support you select the destination table, stored procedure or auto create table(bulk insert) to receive the data:
Data Factory Mapping Data Flow provides more features for the data convert.
Ref: Copy and transform data in Azure SQL Database by using Azure Data Factory.
Hope this helps.

Access Azure Data Lake Analytics Tables from SQL Server Polybase

I need to export a multi terabyte dataset processed via Azure Data Lake Analytics(ADLA) onto a SQL Server database.
Based on my research so far, I know that I can write the result of (ADLA) output to a Data Lake store or WASB using built-in outputters, and then read the output data from SQL server using Polybase.
However, creating the result of ADLA processing as an ADLA table seems pretty enticing to us. It is a clean solution (no files to manage), multiple readers, built-in partitioning, distribution keys and the potential for allowing other processes to access the tables.
If we use ADLA tables, can I access ADLA tables via SQL Polybase? If not, is there any way to access the files underlying the ADLA tables directly from Polybase?
I know that I can probably do this using ADF, but at this point I want to avoid ADF to the extent possible - to minimize costs, and to keep the process simple.
Unfortunately, Polybase support for ADLA Tables is still on the roadmap and not yet available. Please file a feature request through the SQL Data Warehouse User voice page.
The suggested work-around is to produce the information as Csv in ADLA and then create the partitioned and distributed table in SQL DW and use Polybase to read the data and fill the SQL DW managed table.