DBT: conditionally set schema config - dbt

I'm trying to determine how I can conditionally set schema config attributes. I've attempted this by a macro in both dbt_project.yml and also in schema.yml but both of these methods fail with:
00:23:19 Encountered an error:
Compilation Error
Could not render {{get_location_root('lndv')}}: 'get_location_root' is undefined
The outcome I would like to achieve is conditionally setting location_root for Spark for various schemas. I want different locations for each environment. I thought the macro path was the best fit as this follows a pattern but it obviously doesn't work in dbt_project.yml or property files. I was using target.name to determine environment. It's in the same directory as other macros that are successfully rendering in models so the path is set correctly. I don't really want to resort to placing this config in each model if I can avoid it.
Does anyone have any thoughts on how I can solve this? Either getting the macro to work in dbt_project.yml / schema.yml or by some other method?
Regards,
Ashley

dbt only allows a small subset of jinja in .yml files. In particular, you can't use macros. But you can use simple conditionals. Jinja that appears in .yml files must be quoted:
schema: "{{ 'prod_schema' if target.name == 'production' else 'dev_schema' }}"
Another option for you is to override the built-in macro that generates schema names. There is a great write-up in the dbt docs on this topic.
From the docs:
If your dbt project includes a macro that is also named generate_schema_name, dbt will always use the macro in your dbt project instead of the default macro.
Therefore, to change the way dbt generates a schema name, you should add a macro named generate_schema_name to your project, where you can then define your own logic.
There is even an alternative "non-default" version of this macro that ships with dbt, called generate_schema_name_for_env, with the logic:
In prod:
If a custom schema is provided, a model's schema name should match the custom schema, rather than being concatenated to the target schema.
If no custom schema is provided, a model's schema name should match the target schema.
In other environments (e.g. dev or qa):
Build all models in the target schema, as in, ignore custom schema configurations.
To use generate_schema_name_for_env, you create a new macro in your project with the following contents:
-- put this in macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{{ generate_schema_name_for_env(custom_schema_name, node) }}
{%- endmacro %}
EDIT: In Spark, you can use a similar trick to set the "location" of the materialized model by overriding the location_clause macro (which is part of the dbt-spark adapter). Your macro should template to a string with the word "location" followed by a path wrapped in single quotes:
{% macro location_clause() %}
{%- set location_root = config.get('location_root', validator=validation.any[basestring]) -%}
{%- set identifier = model['alias'] -%}
{%- if location_root is not none and target.name == "production" %}
location '{{ location_root }}/prod/{{ identifier }}'
{%- elif location_root is not none %}
location '{{ location_root }}/dev/{{ identifier }}'
{%- endif %}
{%- endmacro -%}

Related

How do I run SQL model in dbt multiple times by looping through variables?

I have a model in dbt (test_model) that accepts a geography variable (zip, state, region) in the configuration. I would like to run the model three times by looping through the variables, each time running it with a different variable.
Here's the catch: I have a macro shown below that appends the variable to the end of the output table name (i.e., running test_model with zip as the variable outputs a table called test_model_zip). This is accomplished by adding {{ config(alias=var('geo')) }} at the top of the model.
Whether I define the variable within dbt_project.yml, the model itself, or on the CLI, I've been unable to find a way to loop through these variables, each time passing the new variable to the configuration, and successfully create three tables. Do any of you have an idea how to accomplish this? FWIW, I'm using BigQuery SQL.
The macro:
{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
{%- if custom_alias_name is none -%}
{{ node.name }}
{%- else -%}
{% set node_name = node.name ~ '_' ~ custom_alias_name %}
{{ node_name | trim }}
{%- endif -%}
{%- endmacro %}
The model, run by entering dbt run --select test_model.sql --vars '{"geo": "zip"}' in the CLI:
{{ config(materialized='table', alias=var('geo')) }}
with query as (select 1 as id)
select * from query
The current output: a single table called test_model_zip.
The desired output: three tables called test_model_zip, test_model_state, and test_model_region.
I would flip this on its head.
dbt doesn't really have a concept for parameterized models, so if you materialize a single model in multiple places, you'll lose lineage (the DAG relationship) and docs/etc. will get all confused.
Much better to create multiple model files that simply call a macro with a different parameter, like this:
geo_model_macro.sql
{% macro geo_model_macro(grain) %}
select
{{ grain }},
count(*)
from {{ ref('my_upstream_table') }}
group by 1
{% endmacro %}
test_model_zip.sql
{{ geo_model_macro('zip') }}
test_model_state.sql
{{ geo_model_macro('state') }}
test_model_region.sql
{{ geo_model_macro('region') }}
If I needed to do this hundreds of times (instead of 3), I would either:
Create a script to generate all of these .sql files for me
Create a new materialization that accepted a list of parameters, but this would be a super-advanced, here-be-dragons approach that is probably only appropriate when you've maxed out your other options.

Change materialization name(-prefix) of seed data in the warehouse

Currently the seed are automaticaly generated in the warehouse with the name dbt_{schema_name}_seed_data, with {schema_name} being the schema name specified in the profiles.yml.
I want to specify a different name, e.g. dbt_processing_seed_data, without changing the schema name in profile.yml to 'processing'.
Reason behind all this, different devs want to have their own schema so they don't interfere with each other. But it is unnecessary that the (same) seed data is stored multiple times in the warehouse.
You can set the schema for a seed in your dbt_project.yml file. See the docs.
To get the behavior you describe, where the target name is not prepended to the schema, you need to override the generate_schema_name macro by creating a new macro with that name in your project. Docs on that are here. You can use the node's resource type so that this behavior is only applied to seeds.
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}{{ default_schema }}
{%- elif node.resource_type == "seed" -%}{{ custom_schema_name | trim }}
{%- else -%}{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
I'd caution against this, though. Seeds are version-controlled, and really aren't intended to be used for large raw datasets (see the docs again). Since they get checked in alongside code, they should really share the same separation of environments that the code has.

DBT custom schema using folder structure

is there a way in DBT to create custom schemas for a model in a derived way by looking at the folder structure?
For example, say this is my structure:
models
└-- product1
└-- team1
| └-- model1.sql
└-- team2
└-- model2.sql
In this case, model1.sql would be created in the schema product1_team1 whereas model2.sql would be created in the schema product1_team2. I guess I can specify those "by hand" in the dbt_project.yml file, but I was wondering if there was a way to do this in an automated way - so that every new model or folder is automatically created in the right schema.
I was looking at custom schema macros (https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-schemas) but it seems to be plain jinja or simple Python built-ins. Not sure how I would be able to access folder paths in those macros.
Also, is there a way to write a macro in Python? as it could be relatively straightforward knowing the file path and with the os module.
You can achieve that using only Jinja functions and dbt context variables.
As you have noticed, we can overwrite the dbt built-in macro that handles the schema's name, and luckily, there's a way to access the model's path using the node variable that is defined in the arguments of the macro.
I used the fqn property for this example:
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{# Check if the model does not contain a subfolder (e.g, models created at the MODELS root folder) #}
{% if node.fqn[1:-1]|length == 0 %}
{{ default_schema }}
{% else %}
{# Concat the subfolder(s) name #}
{% set prefix = node.fqn[1:-1]|join('_') %}
{{ prefix | trim }}
{% endif %}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
The fqn property will return a list based on the location of your model where the first position will be the dbt project name and the last position will be your model's name. So based on your example, we'd have the following:
[<project_name>, 'product1', 'team1', 'model1']
If you do a dbt ls --m <model_name> you'll notice that the output is exactly what fqn returns
The node.fqn[1:-1] is the shortest and most Pythonic way to slice a list. So, the command is basically removing the first and last position of the list (project name & model name) leaving only the remaining path of your model.
With that in mind, we have a condition to check if the model doesn't contain a subfolder, because if that's the case, we'll return just the default_schema defined in the profiles.yml. Otherwise, we proceed with the logic to transform the list into a string by using the join Jinja function.
In case you want, it would be good to do a log of the node variable to see all the available options we have for it.

dbt cannot create two resources with identical database representations

I have a situation here as below:
There are two models in my dbt project
model-A
{{ config(
materialized='ephemeral',
alias='A_0001',
schema=var('xxx_yyy_dataset')
) }}
model-B
{{ config(
materialized='ephemeral',
alias='B_0002',
schema=var('xxx_yyy_dataset')
) }}
And these are getting materialized as incremental in same schema as xxx_yyy_dataset.Table_DDD
{{ config(
materialized='incremental',
alias='Table_DDD',
schema=var('xxx_yyy_dataset')
) }}
SELECT * FROM {{ref('A_0001')}}
UNION ALL
SELECT * FROM {{ref('B_0002')}}
This is working fine and it is ingesting records into target table.
Now I have introduced another model - model-C ind different package
model-C
{{ config(
materialized='incremental',
alias='Table_DDD',
schema=var('xxx_yyy_dataset')
) }}
This gives me the following error:
$ dbt compile --profiles-dir=profile --target ide
Running with dbt=0.16.0
Encountered an error:
Compilation Error
dbt found two resources with the database representation "xxx_yyy_dataset.Table_DDD".
dbt cannot create two resources with identical database representations. To fix this,
change the "schema" or "alias" configuration of one of these resources:
- model.eplus_rnc_dbt_project.conrol_outcome_joined (models/controls/payment/fa-join/conrol_outcome_joined.sql)
- model.eplus_rnc_dbt_project.dq_control_outcome_joined (models/controls/dq/dq-join/dq_control_outcome_joined.sql)
I have configured macro for custom macro as below :
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name }}
{%- endif -%}
{%- endmacro %}
{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
{%- if custom_alias_name is none -%}
{{ node.name }}
{%- else -%}
{{ custom_alias_name | trim }}
enter code here
{%- endif -%}
{%- endmacro %}
dbt is doing its job here!
You have two models that share the exact same configuration — conrol_outcome_joined and dq_control_outcome_joined.
This means that they'll both try to write to the same table: xxx_yyy_dataset.Table_DDD.
dbt is (rightfully) throwing an error here to avoid a problem.
As the error message suggests, you should update one of your models to use a different schema or alias so that it gets represented in your BigQuery project as a separate table.
I had been struggling with the same problem here, I wanted to create a pipeline of tests that would only be written to a single incremental table and it triggers the same error message, but I am afraid it is not possible with DBT.
To resolve it, I created a main model that selects and unions the info from all the individual test models that I created (I previously created a model/table for each test to be applied) and that in the end with the post_hook I just drop the individual tables previously created, thus, I only stick to a final testing table that keeps all the information.
It is not what I really desired since it is not a dynamic implementation, because every test that is created needs to be added to the main table union and also to the drop statement in the post_hook, however if any test breaks individually it would not break all the other tests, neither a bunch of tables exists in my database when I start my work, you just need to orchestrate it at the right time for you.
(Another possible approach could be creating 1 model, where in the pre_hook, you create all the tables that you want, since dbt cannot make models write to the same table, in the "main" part of the model, you select and union the info of all the pre-hook tables, and then in the post-hook you delete the tables created before, not sure if this can work, not tested, but you do reduce the amount of tables written to the Database, which is the main drawback of the 1st approach although for a short period of time)

Jinja / Django for loop range not working

I'm building a django template to duplicate images based on an argument passed from the view; the template then uses Jinja2 in a for loop to duplicate the image.
BUT, I can only get this to work by passing a list I make in the view. If I try to use the jinja range, I get an error ("Could not parse the remainder: ...").
Reading this link, I swear I'm using the right syntax.
template
{% for i in range(variable) %}
<img src=...>
{% endfor %}
I checked the variable I was passing in; it's type int. Heck, I even tried to get rid of the variable (for testing) and tried using a hard-coded number:
{% for i in range(5) %}
<img src=...>
{% endfor %}
I get the following error:
Could not parse the remainder: '(5)' from 'range(5)'
If I pass to the template a list in the arguments dictionary (and use the list in place of the range statement), it works; the image is repeated however many times I want.
What am I missing? The docs on Jinja (for loop and range) and the previous link all tell me that this should work with range and a variable.
Soooo.... based on Franndy's comment that this isn't automatically supported by Django, and following their link, which leads to this link, I found how to write your own filter.
Inside views.py:
from django.template.defaulttags import register
#register.filter
def get_range(value):
return range(value)
Then, inside template:
{% for i in variable|get_range %}
<img src=...>
{% endfor %}