dbt cannot create two resources with identical database representations - dbt

I have a situation here as below:
There are two models in my dbt project
model-A
{{ config(
materialized='ephemeral',
alias='A_0001',
schema=var('xxx_yyy_dataset')
) }}
model-B
{{ config(
materialized='ephemeral',
alias='B_0002',
schema=var('xxx_yyy_dataset')
) }}
And these are getting materialized as incremental in same schema as xxx_yyy_dataset.Table_DDD
{{ config(
materialized='incremental',
alias='Table_DDD',
schema=var('xxx_yyy_dataset')
) }}
SELECT * FROM {{ref('A_0001')}}
UNION ALL
SELECT * FROM {{ref('B_0002')}}
This is working fine and it is ingesting records into target table.
Now I have introduced another model - model-C ind different package
model-C
{{ config(
materialized='incremental',
alias='Table_DDD',
schema=var('xxx_yyy_dataset')
) }}
This gives me the following error:
$ dbt compile --profiles-dir=profile --target ide
Running with dbt=0.16.0
Encountered an error:
Compilation Error
dbt found two resources with the database representation "xxx_yyy_dataset.Table_DDD".
dbt cannot create two resources with identical database representations. To fix this,
change the "schema" or "alias" configuration of one of these resources:
- model.eplus_rnc_dbt_project.conrol_outcome_joined (models/controls/payment/fa-join/conrol_outcome_joined.sql)
- model.eplus_rnc_dbt_project.dq_control_outcome_joined (models/controls/dq/dq-join/dq_control_outcome_joined.sql)
I have configured macro for custom macro as below :
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name }}
{%- endif -%}
{%- endmacro %}
{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
{%- if custom_alias_name is none -%}
{{ node.name }}
{%- else -%}
{{ custom_alias_name | trim }}
enter code here
{%- endif -%}
{%- endmacro %}

dbt is doing its job here!
You have two models that share the exact same configuration — conrol_outcome_joined and dq_control_outcome_joined.
This means that they'll both try to write to the same table: xxx_yyy_dataset.Table_DDD.
dbt is (rightfully) throwing an error here to avoid a problem.
As the error message suggests, you should update one of your models to use a different schema or alias so that it gets represented in your BigQuery project as a separate table.

I had been struggling with the same problem here, I wanted to create a pipeline of tests that would only be written to a single incremental table and it triggers the same error message, but I am afraid it is not possible with DBT.
To resolve it, I created a main model that selects and unions the info from all the individual test models that I created (I previously created a model/table for each test to be applied) and that in the end with the post_hook I just drop the individual tables previously created, thus, I only stick to a final testing table that keeps all the information.
It is not what I really desired since it is not a dynamic implementation, because every test that is created needs to be added to the main table union and also to the drop statement in the post_hook, however if any test breaks individually it would not break all the other tests, neither a bunch of tables exists in my database when I start my work, you just need to orchestrate it at the right time for you.
(Another possible approach could be creating 1 model, where in the pre_hook, you create all the tables that you want, since dbt cannot make models write to the same table, in the "main" part of the model, you select and union the info of all the pre-hook tables, and then in the post-hook you delete the tables created before, not sure if this can work, not tested, but you do reduce the amount of tables written to the Database, which is the main drawback of the 1st approach although for a short period of time)

Related

How do I run SQL model in dbt multiple times by looping through variables?

I have a model in dbt (test_model) that accepts a geography variable (zip, state, region) in the configuration. I would like to run the model three times by looping through the variables, each time running it with a different variable.
Here's the catch: I have a macro shown below that appends the variable to the end of the output table name (i.e., running test_model with zip as the variable outputs a table called test_model_zip). This is accomplished by adding {{ config(alias=var('geo')) }} at the top of the model.
Whether I define the variable within dbt_project.yml, the model itself, or on the CLI, I've been unable to find a way to loop through these variables, each time passing the new variable to the configuration, and successfully create three tables. Do any of you have an idea how to accomplish this? FWIW, I'm using BigQuery SQL.
The macro:
{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
{%- if custom_alias_name is none -%}
{{ node.name }}
{%- else -%}
{% set node_name = node.name ~ '_' ~ custom_alias_name %}
{{ node_name | trim }}
{%- endif -%}
{%- endmacro %}
The model, run by entering dbt run --select test_model.sql --vars '{"geo": "zip"}' in the CLI:
{{ config(materialized='table', alias=var('geo')) }}
with query as (select 1 as id)
select * from query
The current output: a single table called test_model_zip.
The desired output: three tables called test_model_zip, test_model_state, and test_model_region.
I would flip this on its head.
dbt doesn't really have a concept for parameterized models, so if you materialize a single model in multiple places, you'll lose lineage (the DAG relationship) and docs/etc. will get all confused.
Much better to create multiple model files that simply call a macro with a different parameter, like this:
geo_model_macro.sql
{% macro geo_model_macro(grain) %}
select
{{ grain }},
count(*)
from {{ ref('my_upstream_table') }}
group by 1
{% endmacro %}
test_model_zip.sql
{{ geo_model_macro('zip') }}
test_model_state.sql
{{ geo_model_macro('state') }}
test_model_region.sql
{{ geo_model_macro('region') }}
If I needed to do this hundreds of times (instead of 3), I would either:
Create a script to generate all of these .sql files for me
Create a new materialization that accepted a list of parameters, but this would be a super-advanced, here-be-dragons approach that is probably only appropriate when you've maxed out your other options.

Change materialization name(-prefix) of seed data in the warehouse

Currently the seed are automaticaly generated in the warehouse with the name dbt_{schema_name}_seed_data, with {schema_name} being the schema name specified in the profiles.yml.
I want to specify a different name, e.g. dbt_processing_seed_data, without changing the schema name in profile.yml to 'processing'.
Reason behind all this, different devs want to have their own schema so they don't interfere with each other. But it is unnecessary that the (same) seed data is stored multiple times in the warehouse.
You can set the schema for a seed in your dbt_project.yml file. See the docs.
To get the behavior you describe, where the target name is not prepended to the schema, you need to override the generate_schema_name macro by creating a new macro with that name in your project. Docs on that are here. You can use the node's resource type so that this behavior is only applied to seeds.
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}{{ default_schema }}
{%- elif node.resource_type == "seed" -%}{{ custom_schema_name | trim }}
{%- else -%}{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
I'd caution against this, though. Seeds are version-controlled, and really aren't intended to be used for large raw datasets (see the docs again). Since they get checked in alongside code, they should really share the same separation of environments that the code has.

dbt Snapshot - if not execute

I have a dbt Snapshot that calls a Macro to get a list of column names back from a database.
It works fine when using
dbt run
when using the snapshot command it fails because it doesn't run in execute mode.
dbt snapshot
I am currently using if not execute in the Macro which helps for compiling the project.
{%- if not execute -%}
Is there anyway to get around this so I could use the Snapshot functionality without doing a run operation on all models etc?
Thanks
edit :
Macro works fine in models when running dbt run.
When placed in snapshots it runs not in execute mode so the "Test" values are returned instead of values from a query.
{% macro GetColumnNames(DatabaseName, SchemaName, TableName) %}
{%- if not execute -%}
{{ return(["Test1","Test2"]) }}
{% endif %}
{%- set QueryRetrieveColumnNames -%}
SELECT
...
, COLUMN_NAME ...
FROM ...
{%- endset -%}
{% set Results = run_query(QueryRetrieveColumnNames) %}}
{%- set ColumnNames = Results.columns[3].values() -%}}
{{ return(ColumnNames) }}
{% endmacro %}
In the snapshot I'm doing other things, but even just the columns on their own won't work
{% snapshot TestSnapshot %}
{% set Relation = source(...) -%}
{% set ColumnNames = GetColumnNames(Relation.database, Relation.schema, Relation.identifier) -%}
SELECT
'a' AS a
{%- for ColumnName in ColumnNames %}
, "{{ ColumnName.column }}"
{%- endfor %}
FROM {{ source(...) }}
{% endsnapshot %}
I've switched from the Macro to use get_columns_in_relation
{% set ColumnNames = adapter.get_columns_in_relation(Relation) -%}
This fails at parsing, yet runs fine in models.
Parsing Error in snapshot ...
at path ['check_cols']: Undefined is not valid under any of the given schemas
Not sure the context of this question (dbtCloud, CLI etc.) so this is a ballpark solution.
According to the docs on the snapshot command for the CLI, you should be able to use something like:
dbt snapshot --select column_snapshot
if that's the only thing you want to "snapshot"
Additionally, if this is in dbtCloud, you could create a model & job with something like the following (I use this for testing pre-hook & post-hook functionality)
one-model.sql
select 1
(any model with valid sql works)
Then for that cloud job:
dbt seed --full-refresh
dbt run --models one-model --full-refresh
dbt snapshot --select column_snapshot

DBT custom schema using folder structure

is there a way in DBT to create custom schemas for a model in a derived way by looking at the folder structure?
For example, say this is my structure:
models
└-- product1
└-- team1
| └-- model1.sql
└-- team2
└-- model2.sql
In this case, model1.sql would be created in the schema product1_team1 whereas model2.sql would be created in the schema product1_team2. I guess I can specify those "by hand" in the dbt_project.yml file, but I was wondering if there was a way to do this in an automated way - so that every new model or folder is automatically created in the right schema.
I was looking at custom schema macros (https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-schemas) but it seems to be plain jinja or simple Python built-ins. Not sure how I would be able to access folder paths in those macros.
Also, is there a way to write a macro in Python? as it could be relatively straightforward knowing the file path and with the os module.
You can achieve that using only Jinja functions and dbt context variables.
As you have noticed, we can overwrite the dbt built-in macro that handles the schema's name, and luckily, there's a way to access the model's path using the node variable that is defined in the arguments of the macro.
I used the fqn property for this example:
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{# Check if the model does not contain a subfolder (e.g, models created at the MODELS root folder) #}
{% if node.fqn[1:-1]|length == 0 %}
{{ default_schema }}
{% else %}
{# Concat the subfolder(s) name #}
{% set prefix = node.fqn[1:-1]|join('_') %}
{{ prefix | trim }}
{% endif %}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
The fqn property will return a list based on the location of your model where the first position will be the dbt project name and the last position will be your model's name. So based on your example, we'd have the following:
[<project_name>, 'product1', 'team1', 'model1']
If you do a dbt ls --m <model_name> you'll notice that the output is exactly what fqn returns
The node.fqn[1:-1] is the shortest and most Pythonic way to slice a list. So, the command is basically removing the first and last position of the list (project name & model name) leaving only the remaining path of your model.
With that in mind, we have a condition to check if the model doesn't contain a subfolder, because if that's the case, we'll return just the default_schema defined in the profiles.yml. Otherwise, we proceed with the logic to transform the list into a string by using the join Jinja function.
In case you want, it would be good to do a log of the node variable to see all the available options we have for it.

dbt: How Can I Write Source Tables Into Their Own Schema Without Production Schema Prefix?

I'm trying to follow Gitlab's folder and dbt structure. Specifically for sources they've got a separate schema for each of their source tables. My production schema is called analytics and my production database is called analytics. When I run this in production dbt will create analytics.analytics_sfdc instead of analytics.sfdc. How can I set this up so that the source tables are written to analytics.sfdc?
Thanks!
The schema prefix/suffix setup is default in dbt. You can override it by changing the generate_schema_name macro in your project, as outlined here.
This is the code for the default version of the macro:
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}`
You can see the prefix logic in there. To override it, you simply need to create a new version of the macro in your project.
Assuming your production environment is a target called 'prod', it can be as simple as adding this:
{% macro generate_schema_name(custom_schema_name, node) -%}
{{ generate_schema_name_for_env(custom_schema_name, node) }}
{%- endmacro %}
As per the docs, this will behave as follows, which appears to be what you want:
In prod:
If a custom schema is provided, a model's schema name should match the custom schema, rather than being concatenated to the target schema.
If no custom schema is provided, a model's schema name should match the target schema.
In other environments (e.g. dev or qa):
Build all models in the target schema, as in, ignore custom schema configurations.
Alternatively, you can alter the logic of the first code snippet to do something more custom to your specific setup.
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
I use the above as my macros/generate_schema_name.sql to get rid of dbt prefixing your stuff irrespective of the environment