Declaring multiple warehouses in dbt - dbt

I am pretty new to dbt , i want to use two kinds of warehouses in one project , currently i declared my clickhouse warehouse which i am going to make tables for and i need to add another warehouse mindsDB becuase i want to reference some of the tables in it
currently my prfofile.yml looks like this
dbt-project:
target: dev
outputs:
dev:
type: clickhouse
schema : clickhouse_l
host: 8.77.780.70
port: 6000
user: xxx
password: xxxx
i want to add the below warehouse too
type: mysql
host: mysql.mindsdb.com
user: mindsdb.user#example.com
password: xxx
port: 3306
dbname: mindsdb
schema: exampl_xxx
threads: 1
is there a way to do it? thank you

This is a bit outside what dbt is designed to do. Is there any reason you can't use multiple projects with their own deployments? Presumably the models have dependencies on each other?
If I had to do this I would:
Create two targets (or sets of targets), one for each warehouse (or dev/prod for each warehouse, etc.)
Create a custom materialization that wraps the typical table materialization, but no-ops if target.type does not match a specified adapter
Run the project on each adapter in series, in a shell script
Use tags to select parts of the DAG that are up/downstream from the other adapter's models
I think the core of the problem is that dbt run needs a database connection to compile your project, so you really can't run against two databases simultaneously. What I described above is not really any better than having two standalone projects. You may want to look into using an orchestrator, like Airflow, Dagster, or Prefect.

Related

'DBT docs generate' does not populate model column-level comments in the catalog

I use dbt-snowflake 1.1.0 with the corresponding dbt-core 1.1.0.
I added documentation for my models in yaml files, i.e.:
> models/stage/docs.yml
version: 2
models:
- name: raw_weblogs
description: Web logs of customer interaction, broken down by attribute (bronze). The breakdown is performed using regular expressions.
columns:
- name: ip
description: IP address from which the request reached the server (might be direct customer IP or the address of a VPN/proxy).
...
Although these details show up correctly in the DBT UI when i run dbt docs generate and then dbt docs serve, yet they are not listed in target/catalog.json:
cat target/catalog.json | grep identity
(no results)
According to the DBT documentation, I understand that column comments should be part of catalog.json.
LATER EDIT: I tried running dbt --debug docs generate and it seems that all data is retrieved directly from the target environment (in my case, Snowflake). Looking at the columns of my model in Snowflake, they indeed do NOT have any comments posted on the in Snowflake.
It thus seems to me that the underlying error might be with the fact that dbt run does not correctly persist the column metadata to Snowflake.
After further investigation, I found out the reason for lacking comments was indeed the fact that the comments are written to catalog.json when running dbt docs generate based on what is received from the database, while dbt docs serve populates the UI by combining information from catalog.json with metadata (in my case, documented column comments) from the local dbt models.
The solution to persist such metadata in the database with dbt run was to add the following DBT configuration:
> dbt_project.yml
models:
<project>:
...
+persist_docs:
relation: true
columns: true

How to best implement dynamic dbt datasets

I'm cleaning up a dbt + BigQuery environment and trying to implement a staging environment that pulls from a staging dataset. Problem is that the current .yml files with source information all explicitly point to a production dataset.
One option that I am considering is a source wrapper function that will serve as an adapter and inject the proper dataset depending on some passed CLI var or profile target (which is different for the staging vs prod environments).
However, I'm fairly new to dbt so unsure if this is the best way to go about this. Would appreciate any insight you kind folks have :)
EDIT: I'm realizing that a source wrapper is not the way to go as it would mess with the generated DAG
You can supply the name of the schema for a source in a variable or environment variable, and set that variable at runtime.
In your sources.yml:
version: 2
sources:
- name: jaffle_shop
schema: "{{ var('source_jaffle_shop_schema) }}"
tables:
- name: orders
In your dbt_project.yml:
vars:
- source_jaffle_shop_schema: MY_DEFAULT_SCHEMA
And then to override at runtime:
dbt run --vars "{source_jaffle_shop_schema: my_other_schema}"

dbt depends on a source not found

Could you please help me with this issue?
Encountered an error:
Compilation Error in model metrics_model (models\example\metrics_model.sql)
Model 'model.test_project.metrics_model' (models\example\metrics_model.sql) depends on a source named 'automate.metrics' which was not found
I am having this monotonous error, which I have not been able to solve.
Many thanks beforehand!
This is due to the automate.metrics table missing from the database (either the dbt project’s target database or a different database on the same server). There should be a source.yml or automate.yml file somewhere in your project that defines the source. FYI automate is the schema name and metrics is the table name.
If the source yml file specifies a database for the automate schema, query that database to make sure that the metrics table exists in the automate schema.
If the source yml file doesn’t list a database, then that schema / table should exist in the dbt project’s target database. You can see what the target database is by looking at the profile for your project setup in ~/.dbt/profiles.yml.
For PostgreSQL database please check if the sources.yml file is defined as follows:
version: 2
sources:
- name: name_of the source
schema: name_of_the_schema
quoting:
database: false
schema: false
identifier: false
loader: stitch
tables:
- name: name_of_table1
- name: name_of_table2
Are you seeing this in your dev environment? It's possible that you've not run dbt run after creating the automate.metrics which is preventing metrics_model from referencing it.
Check whether you put source config in the right yaml file. I encountered this issue and tried every solutions including above one. Then finally I forgot to put suffix .yml in the source file, and when dbt can't locate source config in that file.

Getting data from different database on different server with one SQL Server query

Server1: Prod, hosting DB1
Server2: Dev hosting DB2
Is there a way to query databases living on 2 different server with a same select query? I need to bring all the new rows from Prod to dev, using a query
like below. I will be using SQL Server DTS (import export data utility)to do this thing.
Insert into Dev.db1.table1
Select *
from Prod.db1.table1
where table1.PK not in (Select table1.PK from Dev.db1.table1)
Creating a linked server is the only approach that I am aware of for this to occur. If you are simply trying to add all new rows from prod to dev then why not just create a backup of that one particular table and pull it into the dev environment then write the query from the same server and database?
Granted this is a one time use and a pain for re-occuring instances but if it is a one time thing then I would recommend doing that. Otherwise make a linked server between the two.
To backup a single table in SQL use the SQl Server import and export wizard. Select the prod database as your datasource and then select only the prod table as your source table and make a new table in the dev environment for your destination table.
This should get you what you are looking for.
You say you're using DTS; the modern equivalent would be SSIS.
Typically you'd use a data flow task in an SSIS package to pull all the information from the live system into a staging table on the target, then load it from there. This is a pretty standard operation when data warehousing.
There are plenty of different approaches to save you copying all the data across (e.g. use a timestamp, use rowversion, use Change Data Capture, make use of the fact your primary key only ever gets bigger, etc. etc.) Or you could just do what you want with a lookup flow directly in SSIS...
The best approach will depend on many things: how much data you've got, what data transfer speed you have between the servers, your key types, etc.
When your servers are all in one Active Directory, and when you use Windows Authentification, then all you need is an account which has proper rights on all the databases!
You can then simply reference all tables like server.database.schema.table
For example:
insert into server1.db1.dbo.tblData1 (...)
select ... from server2.db2.dbo.tblData2;

Copying Vertica Schema or all tables in a schema from one physical cluster to another physical Cluster

I am trying to export and import Vertica schema from one physical cluster to another physical cluster.
My Test instance has one single cluster and my production instance has 3 clusters.
I explored following options, but they are limited to moving data on one physical Vertica instance:
EXPORT TO VERTICA ..
COPY schema.table FROM VERTICA ...
Would like to know if there is an option to move the Vertica schema from one physical Vertica instance to another, with different cluster configuration.
This is tricky manipulation, which have many issues:
If you copy over DDLs, you will lose the current value of sequences, which might mean duplicate primary key when you insert data.
If columns are set up as AUTO_INCREMENT, you will not be able to insert data in it as it is on the source (you cannot force an auto_increment column, although I believe this might have been fixed in new releases).
If you copy DDLs between clusters with a different number of nodes, if node names are part of projection definition, you will end up with something you do not want.
As you noticed, different networks will prevent the use of CONNECT.
An attempt to help out with this has been done in python via the pyvertica utility, and specially the vertica_migrate script. You can find the doc at https://pyvertica.readthedocs.org .
This is a tricky job, and I know there are some issues in this script, although it already helped me a lot.
Hope this helped,
You can use either COPY FROM VERTICA or EXPORT TO VERTICA to import/export the data to another Vertica database (regardless of node configuration). Also, the target table must already exist. You can use EXPORT_OBJECTS to export the DDL. Both methods allow for data migration from a version that's an earlier release from the last major release (running 6.x, you can import from 5.x).
In the example below, I'll use EXPORT TO VERTICA to export data from the source database to the target database.
You must first create a connection to the other database:
CONNECT TO VERTICA VMart USER dbadmin PASSWORD '' ON 'VerticaTarget',5433;
Then use EXPORT TO VERTICA to export the data from the source to the target database:
EXPORT TO VERTICA VMart.schema.customer_dimension FROM schema.customer_dimension;
|______________________________| |_______________________|
| |
Target Source
DISCONNECT VMart;