DBT select Big Query table from different Google Project - google-bigquery

I am using DBT to read and write tables in Big Query, all running in my Google project X.
I have one table which I want to read in from a different Google project Y and put in a DBT model (which will then be saved as a table in project X).
Is it possible to do? And if yes, where do I define the different project in FROM {{ source('dataset_project_y', 'table_to_read')}}?

first, you need to declare the source in a source.yml file.
https://docs.getdbt.com/docs/building-a-dbt-project/using-sources#declaring-a-source
for example, create a source_y.yml
sources:
- name: dataset_project_y
schema: dataset_y
database: 'project_y'
tables:
- name: table_to_read
identifier: table_to_read
after that,
you could refer to source table_to_read in any dbt model, select from it in any of the dbt models' SQL satements.
https://docs.getdbt.com/docs/building-a-dbt-project/using-sources#selecting-from-a-source
For example, to use table_to_read in dbt_model_x.sql
{{
config(
materialized = "view",
)
}}
SELECT * FROM {{ source('dataset_project_y', 'table_to_read')}}

Related

Creation of Table using DBT

Can we create a new table in DBT?
Can we copy the table structure which is present in the dev environment in the database to another environment using DBT?
Yes. However, Dbt needs a "reason" to create tables, for example, to materialize the data produced by one of its models. DBT cannot create table just for the creation's sake.
Well, strictly speaking, you can do this by putting CREATE TABLE... in a pre-hook or post-hook section, but I suppose this is not what you want since dbt makes no difference here at all.
You can define your existed table in sources where you can set database, schema and table name different from the target storage space where dbt writes data. And then, define a model something like:
{{ materialized="table" }}
select *
from {{ source('your_source', 'existed_table_name') }}
limit 1 /* add "limit 1" if you only want the structure */
Put necessary connection credentials in the profiles.yml, and build the model. Dbt will copy one row from source table into model table, before that model table creation is done for free.

How to best implement dynamic dbt datasets

I'm cleaning up a dbt + BigQuery environment and trying to implement a staging environment that pulls from a staging dataset. Problem is that the current .yml files with source information all explicitly point to a production dataset.
One option that I am considering is a source wrapper function that will serve as an adapter and inject the proper dataset depending on some passed CLI var or profile target (which is different for the staging vs prod environments).
However, I'm fairly new to dbt so unsure if this is the best way to go about this. Would appreciate any insight you kind folks have :)
EDIT: I'm realizing that a source wrapper is not the way to go as it would mess with the generated DAG
You can supply the name of the schema for a source in a variable or environment variable, and set that variable at runtime.
In your sources.yml:
version: 2
sources:
- name: jaffle_shop
schema: "{{ var('source_jaffle_shop_schema) }}"
tables:
- name: orders
In your dbt_project.yml:
vars:
- source_jaffle_shop_schema: MY_DEFAULT_SCHEMA
And then to override at runtime:
dbt run --vars "{source_jaffle_shop_schema: my_other_schema}"

Declaring multiple warehouses in dbt

I am pretty new to dbt , i want to use two kinds of warehouses in one project , currently i declared my clickhouse warehouse which i am going to make tables for and i need to add another warehouse mindsDB becuase i want to reference some of the tables in it
currently my prfofile.yml looks like this
dbt-project:
target: dev
outputs:
dev:
type: clickhouse
schema : clickhouse_l
host: 8.77.780.70
port: 6000
user: xxx
password: xxxx
i want to add the below warehouse too
type: mysql
host: mysql.mindsdb.com
user: mindsdb.user#example.com
password: xxx
port: 3306
dbname: mindsdb
schema: exampl_xxx
threads: 1
is there a way to do it? thank you
This is a bit outside what dbt is designed to do. Is there any reason you can't use multiple projects with their own deployments? Presumably the models have dependencies on each other?
If I had to do this I would:
Create two targets (or sets of targets), one for each warehouse (or dev/prod for each warehouse, etc.)
Create a custom materialization that wraps the typical table materialization, but no-ops if target.type does not match a specified adapter
Run the project on each adapter in series, in a shell script
Use tags to select parts of the DAG that are up/downstream from the other adapter's models
I think the core of the problem is that dbt run needs a database connection to compile your project, so you really can't run against two databases simultaneously. What I described above is not really any better than having two standalone projects. You may want to look into using an orchestrator, like Airflow, Dagster, or Prefect.

dbt depends on a source not found

Could you please help me with this issue?
Encountered an error:
Compilation Error in model metrics_model (models\example\metrics_model.sql)
Model 'model.test_project.metrics_model' (models\example\metrics_model.sql) depends on a source named 'automate.metrics' which was not found
I am having this monotonous error, which I have not been able to solve.
Many thanks beforehand!
This is due to the automate.metrics table missing from the database (either the dbt project’s target database or a different database on the same server). There should be a source.yml or automate.yml file somewhere in your project that defines the source. FYI automate is the schema name and metrics is the table name.
If the source yml file specifies a database for the automate schema, query that database to make sure that the metrics table exists in the automate schema.
If the source yml file doesn’t list a database, then that schema / table should exist in the dbt project’s target database. You can see what the target database is by looking at the profile for your project setup in ~/.dbt/profiles.yml.
For PostgreSQL database please check if the sources.yml file is defined as follows:
version: 2
sources:
- name: name_of the source
schema: name_of_the_schema
quoting:
database: false
schema: false
identifier: false
loader: stitch
tables:
- name: name_of_table1
- name: name_of_table2
Are you seeing this in your dev environment? It's possible that you've not run dbt run after creating the automate.metrics which is preventing metrics_model from referencing it.
Check whether you put source config in the right yaml file. I encountered this issue and tried every solutions including above one. Then finally I forgot to put suffix .yml in the source file, and when dbt can't locate source config in that file.

dbt partitions bigquery table on wrong field

Trying to add data build tool to our ecosystem and faced with titled problem. We are currently using huge BigQuery tables so we want them to be partitioned and extended every day. IDK if that matters, but everything is running in docker container. Here's the way you can reproduce it:
bigquery sql query to create source table:
create table `***.dbt_nick_test.partition_test_20210304` (
session_date DATE,
user_id STRING
);
insert into `***.dbt_nick_test.partition_test_20210304` (session_date, user_id)
values ('2021-03-04', '1234'), ('2021-03-04', NULL), ('2021-03-04', '1235');
dbt_project.yml - models defenition part:
models:
***:
test:
+schema: test
profiles.yml - just in case to be sure everything is configured ok:
***-bq:
target: "{{ env_var('DBT_TARGET', 'dev') }}"
outputs:
dev:
type: bigquery
method: service-account
project: ***
dataset: dbt_nick_test
threads: 4
keyfile: /root/.dbt/bq-creds.json
timeout_seconds: 300
priority: interactive
retries: 1
cat models/test/test.sql:
{{
config(
partition_by={
"field": "session_date",
"data_type": "date",
"granularity": "day"
},
partitions=dbt.partition_range(var('dates', default=yesterday())),
verbose=True
)
}}
SELECT
session_date,
user_id
FROM `***`.`dbt_nick_test`.`{{ date_sharded_table('partition_test_') }}`
yesterday macro is default from dbt tutorial.
After running dbt -dS run -m test --vars 'dates: "20210304, 20210304"' (everything is going OK) dbt reports that table is created successfully. Now, going to BigQuery I can see that the table is actually created, but it has bad "partition by" field -- _PARTITIONTIME instead of "session_date" screenshot.
If I manually create correctly partitioned table and then run dbt run -- it will work as expected, everything is perfect.
Also, tables created from this table using dbt are also badly partitioned.