I'm cleaning up a dbt + BigQuery environment and trying to implement a staging environment that pulls from a staging dataset. Problem is that the current .yml files with source information all explicitly point to a production dataset.
One option that I am considering is a source wrapper function that will serve as an adapter and inject the proper dataset depending on some passed CLI var or profile target (which is different for the staging vs prod environments).
However, I'm fairly new to dbt so unsure if this is the best way to go about this. Would appreciate any insight you kind folks have :)
EDIT: I'm realizing that a source wrapper is not the way to go as it would mess with the generated DAG
You can supply the name of the schema for a source in a variable or environment variable, and set that variable at runtime.
In your sources.yml:
version: 2
sources:
- name: jaffle_shop
schema: "{{ var('source_jaffle_shop_schema) }}"
tables:
- name: orders
In your dbt_project.yml:
vars:
- source_jaffle_shop_schema: MY_DEFAULT_SCHEMA
And then to override at runtime:
dbt run --vars "{source_jaffle_shop_schema: my_other_schema}"
Related
I use dbt-snowflake 1.1.0 with the corresponding dbt-core 1.1.0.
I added documentation for my models in yaml files, i.e.:
> models/stage/docs.yml
version: 2
models:
- name: raw_weblogs
description: Web logs of customer interaction, broken down by attribute (bronze). The breakdown is performed using regular expressions.
columns:
- name: ip
description: IP address from which the request reached the server (might be direct customer IP or the address of a VPN/proxy).
...
Although these details show up correctly in the DBT UI when i run dbt docs generate and then dbt docs serve, yet they are not listed in target/catalog.json:
cat target/catalog.json | grep identity
(no results)
According to the DBT documentation, I understand that column comments should be part of catalog.json.
LATER EDIT: I tried running dbt --debug docs generate and it seems that all data is retrieved directly from the target environment (in my case, Snowflake). Looking at the columns of my model in Snowflake, they indeed do NOT have any comments posted on the in Snowflake.
It thus seems to me that the underlying error might be with the fact that dbt run does not correctly persist the column metadata to Snowflake.
After further investigation, I found out the reason for lacking comments was indeed the fact that the comments are written to catalog.json when running dbt docs generate based on what is received from the database, while dbt docs serve populates the UI by combining information from catalog.json with metadata (in my case, documented column comments) from the local dbt models.
The solution to persist such metadata in the database with dbt run was to add the following DBT configuration:
> dbt_project.yml
models:
<project>:
...
+persist_docs:
relation: true
columns: true
I am pretty new to dbt , i want to use two kinds of warehouses in one project , currently i declared my clickhouse warehouse which i am going to make tables for and i need to add another warehouse mindsDB becuase i want to reference some of the tables in it
currently my prfofile.yml looks like this
dbt-project:
target: dev
outputs:
dev:
type: clickhouse
schema : clickhouse_l
host: 8.77.780.70
port: 6000
user: xxx
password: xxxx
i want to add the below warehouse too
type: mysql
host: mysql.mindsdb.com
user: mindsdb.user#example.com
password: xxx
port: 3306
dbname: mindsdb
schema: exampl_xxx
threads: 1
is there a way to do it? thank you
This is a bit outside what dbt is designed to do. Is there any reason you can't use multiple projects with their own deployments? Presumably the models have dependencies on each other?
If I had to do this I would:
Create two targets (or sets of targets), one for each warehouse (or dev/prod for each warehouse, etc.)
Create a custom materialization that wraps the typical table materialization, but no-ops if target.type does not match a specified adapter
Run the project on each adapter in series, in a shell script
Use tags to select parts of the DAG that are up/downstream from the other adapter's models
I think the core of the problem is that dbt run needs a database connection to compile your project, so you really can't run against two databases simultaneously. What I described above is not really any better than having two standalone projects. You may want to look into using an orchestrator, like Airflow, Dagster, or Prefect.
I'm creating a lot of stuff based on the manifest.json that dbt generates for me. But for whatever reason the "data_type" property for each column is always None in the manifest.json, even though I can see it in the catalog.json, I believe the data type is generated from the database.
How do I get the data_type attribute populated in my manifest.json file ?
Some helpful answers from this dbt Slack thread:
first reply (h/t Daniel Luftspring)
not sure if this is the only way but i'm running dbt version 0.20.1 you can specify the data_type as a column property in your schema.yml and it will show up in the manifest like so:
columns:
- name: city
data_type: string
If you have a big project and wanted to automate this you could probably pull together a script to edit your schema files in place and sync the data types with your db using the information schema
second reply (h/t Jonathon Talmi)
FYI catalog.json has data type becaause it queries the metadata data
tables in your dwh (e.g. info schema in snowflake) to contruct the
catalog, but your traditional dbt compile/run/etc which. generates a
manifest does not have such queries
Could you please help me with this issue?
Encountered an error:
Compilation Error in model metrics_model (models\example\metrics_model.sql)
Model 'model.test_project.metrics_model' (models\example\metrics_model.sql) depends on a source named 'automate.metrics' which was not found
I am having this monotonous error, which I have not been able to solve.
Many thanks beforehand!
This is due to the automate.metrics table missing from the database (either the dbt project’s target database or a different database on the same server). There should be a source.yml or automate.yml file somewhere in your project that defines the source. FYI automate is the schema name and metrics is the table name.
If the source yml file specifies a database for the automate schema, query that database to make sure that the metrics table exists in the automate schema.
If the source yml file doesn’t list a database, then that schema / table should exist in the dbt project’s target database. You can see what the target database is by looking at the profile for your project setup in ~/.dbt/profiles.yml.
For PostgreSQL database please check if the sources.yml file is defined as follows:
version: 2
sources:
- name: name_of the source
schema: name_of_the_schema
quoting:
database: false
schema: false
identifier: false
loader: stitch
tables:
- name: name_of_table1
- name: name_of_table2
Are you seeing this in your dev environment? It's possible that you've not run dbt run after creating the automate.metrics which is preventing metrics_model from referencing it.
Check whether you put source config in the right yaml file. I encountered this issue and tried every solutions including above one. Then finally I forgot to put suffix .yml in the source file, and when dbt can't locate source config in that file.
I have a Dataflow pipeline that creates an Entity in Google Datastore. I can then search for that entity within the pipeline using a Query, but only if I don't include an Operator.GREATER_THAN_OR_EQUAL Filter on a timestamp property.
When I include that more complex query, I get the error:
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: com.google.datastore.v1.client.DatastoreException: no matching index found. recommended index is:
- kind: FileInjectionRecord
properties:
- name: FilePath
- name: LastModifiedDate
, code=FAILED_PRECONDITION
It seems fairly clear that I need to take the recommended index in the error message and add it to an index.yaml file, but all the documentation that I have seen so far about this talks about including the index.yaml file in your WAR file, which I don't have in this case.
Is it possible to define a composite index somewhere in a Dataflow pipeline or do I need to set this up as a one off outside of the pipeline?