Can dbt connect to different databases in the same project? - dbt

It seems dbt only works for a single database.
If my data is in a different database, will that still work? For example, if my datalake is using delta, but I want to run dbt using Redshift, would dbt still work for this case?

To use dbt, you need to already be able to select from your raw data in your warehouse.
In general, dbt is not an ETL tool:
[dbt] doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. This “transform after load” architecture is becoming known as ELT (extract, load, transform). dbt is the T in ELT. [reference]
So no, you cannot use dbt with Redshift and Deltalake at the same time. Instead, use a separate service to extract and load data into your Redshift cluster — dbt is agnostic about which tool you use to do this.
There is a nuance to this answer - you could use dbt to select from external files in S3 or GCS, so long as you've set up your data warehouse to be able to read those files. For Redshift, this means setting up Redshift Spectrum. (For Snowflake, this means setting up an external table and on BigQuery, you can also query cloud storage data)
So, if the data you read in Deltalake lives in S3, if you set up your Redshift cluster to be able to read it, you can use dbt to transform the data!

You can use Trino with dbt to connect to multiple databases in the same project.
The Github example project https://github.com/victorcouste/trino-dbt-demo contains a fully working setup, that you can replicate and adapt to your needs.

I would say that DBT doesn't have an execution engine, so you can not use it to move data from one source to another as it isn't processing data itself, it only sends the SQL commands to the database.
In any case, if you want to move data from S3 to Redshift, maybe you could use Redshift Spectrum where you can query S3 as external tables. There you'll be able to use DBT on S3 and Redshift data from the same system.

#willie Chen the short answer is yes you can. The more accurate answer that is not the intent of dbt. As a tool it is intended for the transform part of ETL. It serves as a transform that is already existing in a data warehouse. I agree that you should use Redshift Spectrum for ETL.
Luther

Related

BigQuery to BigQuery DataFlow

I've had a look at this SO post but it's three years old and I think GCP has changed since then.
What I'm trying to do is set up a data pipeline using DataFlow jobs to copy/transform data from one GBQ project into another GBQ project.
To create a DataFlow job, you need to choose a template and there is no template that matches my needs i.e. no BQ to BQ template.
There is an option to use a custom template (which I imagine would be a python script or something along those lines), but it seems odd that there is no BQ to BQ template. Is DataFlow not the right tool for this job? Should I just use scheduled queries?
Thanks in advance
There is a way which is not very straight forward if you really want to use Dataflow template, you can use BigQuery to cloud storage template to store data in GCS and then cloud storage to BigQuery template to bring the data to destination project. However make sure you gave proper permission that is required to access the cloud storage buckets from the destination project.
If the transformations you want are not possible using SQL or not practical to use SQL, you can use Cloud Data fusion -> Integration studio. Here you can choose both source and sink as BigQuery and there are a number of options available for transformation component. It is similar to ETL tool. Data Fusion Quickstart documentation.
Otherwise, you can simply execute or schedule a query as per your requirement in BigQuery itself and save the result of the query in another table Saving query results in destination table.

Load daily MySQL DB snapshots from S3 to snowflake

I have daily MySQL DB snapshots stored on S3. This daily DB snapshot is a backup of 1000 tables in our DB, using mysqldump, size is about 300M daily (stored 1 year of snapshots, which is about 110G).
Now we want to load these snapshots daily to snowflake for reporting purpose. How do we create tables in snowflake? Shall we create 1000 tables? Will snowflake be able to handle this scenario?
All comments are welcome. Thanks!
One comment before I look at possible solutions: your statement "Our purpose is to avoid creating dimension or fact tables (typical data warehouse approach) to save cost at the beginning" is the sort of thinking that can get companies into real trouble. Once you build something and start using it, in 99% of cases you will be stuck with it - so not designing a proper, supportable, reporting solution (whether it is a Kimball model or something else) from the start is always a false economy. If you take a "quick and dirty" approach now you will regret it in a year's time.
With that out of the way, there seem to be 2 issues you need to address:
How to store your data
How to process your data (to produce you metrics and whatever else you want to do with it)
Data Storage
(Probably stating the obvious) Any tables that you create to hold metrics or which will be accessed by BI tools (including direct SQL) I would hold in Snowflake - otherwise you wont get the performance that Snowflake can deliver and there is little point using Snowflake - you might as well be using Athena directly against your S3 buckets.
For your source tables (currently in S3), in an ideal world I would also copy them into Snowflake and treat S3 as your staging area - so once the data has been copied from S3 to Snowflake you can drop the data from S3 (or archive it or do whatever you want to it).
However, if you need the S3 versions of the data for other purposes (and so can't delete it once it has been copied to Snowflake) then rather than keep duplicate copies of the data you could create External Tables in Snowflake that point to your S3 buckets and don't require you to move the data into Snowflake. Query performance against External Tables will be worse than if the tables were within Snowflake, but performance may be good enough for your purposes - especially if they are "just" being used as data sources rather than for analytical queries.
Computation
There are a number of options for the technologies you use to calculate your metrics - which one you choose is probably down to your existing skillset, cost, supportability, etc.
Snowflake functionality - Stored Procedures, External Functions (still in Preview rather than GA, I believe), etc.
External coding tools: anything that can connect to Snowflake and read/write data (e.g. Python, Spark, etc.)
ETL/ELT tool - probably overkill for your specific use case but if you are building a proper reporting platform that requires an ETL tool then obviously you could use this to create your metrics as well as move your data around
Hope this helps?

De-Identifying PHI For HIPAA

I have a SQL DB which contains PHI, hosted on AWS. I want to access this data to perform analytics, however, I must de-identify the data first to comply with HIPAA.
How should I approach this? I have thought of a few approaches:
Simply de-identify the DB with SQL commands.
From now on, every time the DB is added to, add a de-identified version of that data to another DB. Then access this DB for analytics.
From now on, every time the DB is added to, add a de-identified version of that data to another table in that DB. Then access this table with SQL commands for analytics.
Which is the best approach to use to maintain compliance with HIPAA? Or, is there a better way?
Thanks!
Budget allowing, consider doing your analytics on a different system and during the ETL, de-identify the data. Changing the source system to accommodate this requirement will increase complexity to maintain and likely affect other integrations - might end up with a monolith.
There's various ways to do this: You could do a AWS DMS (with ongoing replication) with the DB as your source and S3 as target (parquet format). From there you could use Athena for analytics as jarmod highlighted, which also supports parquet format and you can even use SQL-like queries in Athena to analyze your data. There's also Redshift, send to another Relational DB, other analytics platforms etc.

How to transform data from S3 bucket before writing to Redshift DW?

I'm creating a (modern) data warehouse in redshift. All of our infrastructure is hosted at Amazon. So far, I have setup DMS to ingest data (including changed data) from some tables of our business database (SQL Server on EC2, not RDS) and store it directly to S3.
Now I must transform and enrich this data from the S3 before I can write it to Redshift. Our DW have some tables for facts and dimensions (star schema), so, imagine a Customer dimension, it should contain not only the customer basic info, but address info, city, state, etc. This data is spread amongst a few tables in our business database.
So here's my problem, I don't have a clear idea of how to query the S3 staging area in order to join these tables and write it to my redshift DW. I want to do it using AWS services like Glue, Kinesis, etc. i.e. full serverless.
Can Kinesis accomplish this task? Would it make things easier if I moved my staging area from S3 to Redshift since all of our data is highly relational in nature anyway? If so, the question remains, how to transform/enrich data before saving it on our DW schemas? I have searched everywhere for this particular topic but information on it is scarse.
Any help is appreciated.
There are a lot of ways to do this but one idea is to query the data using Redshift Spectrum. Spectrum is a way to query S3 (called an external database) using your Redshift cluster.
Really high-level, one way to do this would be to create a Glue Crawler job to crawl your S3 bucket, which creates the External Database that Redshift Spectrum can query.
This way, you don't need to move your data into Redshift itself. Likely, you will want to keep your "staging" area in S3 and only bring into Redshift the data that is ready to be used for reporting or analytics, which would be your Customer Dim table.
Here is the documentation to do this: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
To schedule the ETL SQL: I don't believe there is a scheduling tool built into Redshift but you can do that in a few ways:
1) Get an ETL tool or set up CRON jobs on a server or Glue that schedules SQL scripts to be ran. I do this with a Python script that connects to the database then runs the SQL text. This would be a little bit more of a bulk operation. You can also do this in a Lambda function and have it be scheduled on a Cloudwatch trigger which can be on a cron schedule
2) Use a Lambda function that runs the SQL script that you want that triggers for S3 PUTs into that bucket. That way the script will run right when the file drops. This would be basically a realtime operation. DMS drops files very quickly so you will have files dropping multiple times per minute so that might be more difficult to handle.
One option is to load the 'raw' data into Redshift as 'staging' tables. Then, run SQL commands to manipulate the data (JOINs, etc) into the desired format.
Finally, copy the resulting data into the 'public' tables that users query.
This is a normal Extract-Load-Transform process (slightly different to ETL) that uses the capabilities of Redshift to do the transform.

Presto and Hive

I'm trying to enable basic SQL querying of CSV files located in an s3 directory. Presto seemed like a natural fit (the files are 10s GB). As I went through the setup in Presto, I tried creating a table using the Hive connector. It was not clear to me if I only needed the hive metastore to save my table configurations in Presto, or if I have to create them in there first.
The documentation makes it seem that you can use Presto without having to CONFIGURE Hive, but using Hive syntax. Is that accurate? My experiences are that AWS S3 has not been able to connect.
Presto syntax is similar to Hive syntax. For most simple queries, the identical syntax would function in both. However, there are some key differences that make Presto and Hive not entirely the same thing. For example, in Hive, you might use LATERAL VIEW EXPLODE, whereas in Presto you'd use CROSS JOIN UNNEST. There are many such examples of nuanced syntactical differences between the two.
It is not possible to use vanilla Presto to analyze data on S3 without Hive. Presto provides only distributed execution engine. However, it lacks metadata information about tables. Thus, Presto Coordinator needs Hive to retrieve table metadata to parse and execute a query.
However, you can use AWS Athena, which is managed Presto, to run queries on top of S3.
Another option, in recent 0.198 release Presto adds a capability to connect AWS Glue and retrieve table metadata on top of files in S3.
I know it's been a while, but if this question is still outstanding, have you considered using Spark? Spark connects easily with out-of-the-box methods and can query/process data living in S3/CSV formats.
Also, I'm curious: what solution did you end up implementing to resolve your issue?