DBT custom schema using folder structure - dbt

is there a way in DBT to create custom schemas for a model in a derived way by looking at the folder structure?
For example, say this is my structure:
models
└-- product1
└-- team1
| └-- model1.sql
└-- team2
└-- model2.sql
In this case, model1.sql would be created in the schema product1_team1 whereas model2.sql would be created in the schema product1_team2. I guess I can specify those "by hand" in the dbt_project.yml file, but I was wondering if there was a way to do this in an automated way - so that every new model or folder is automatically created in the right schema.
I was looking at custom schema macros (https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-schemas) but it seems to be plain jinja or simple Python built-ins. Not sure how I would be able to access folder paths in those macros.
Also, is there a way to write a macro in Python? as it could be relatively straightforward knowing the file path and with the os module.

You can achieve that using only Jinja functions and dbt context variables.
As you have noticed, we can overwrite the dbt built-in macro that handles the schema's name, and luckily, there's a way to access the model's path using the node variable that is defined in the arguments of the macro.
I used the fqn property for this example:
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{# Check if the model does not contain a subfolder (e.g, models created at the MODELS root folder) #}
{% if node.fqn[1:-1]|length == 0 %}
{{ default_schema }}
{% else %}
{# Concat the subfolder(s) name #}
{% set prefix = node.fqn[1:-1]|join('_') %}
{{ prefix | trim }}
{% endif %}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
The fqn property will return a list based on the location of your model where the first position will be the dbt project name and the last position will be your model's name. So based on your example, we'd have the following:
[<project_name>, 'product1', 'team1', 'model1']
If you do a dbt ls --m <model_name> you'll notice that the output is exactly what fqn returns
The node.fqn[1:-1] is the shortest and most Pythonic way to slice a list. So, the command is basically removing the first and last position of the list (project name & model name) leaving only the remaining path of your model.
With that in mind, we have a condition to check if the model doesn't contain a subfolder, because if that's the case, we'll return just the default_schema defined in the profiles.yml. Otherwise, we proceed with the logic to transform the list into a string by using the join Jinja function.
In case you want, it would be good to do a log of the node variable to see all the available options we have for it.

Related

DBT: conditionally set schema config

I'm trying to determine how I can conditionally set schema config attributes. I've attempted this by a macro in both dbt_project.yml and also in schema.yml but both of these methods fail with:
00:23:19 Encountered an error:
Compilation Error
Could not render {{get_location_root('lndv')}}: 'get_location_root' is undefined
The outcome I would like to achieve is conditionally setting location_root for Spark for various schemas. I want different locations for each environment. I thought the macro path was the best fit as this follows a pattern but it obviously doesn't work in dbt_project.yml or property files. I was using target.name to determine environment. It's in the same directory as other macros that are successfully rendering in models so the path is set correctly. I don't really want to resort to placing this config in each model if I can avoid it.
Does anyone have any thoughts on how I can solve this? Either getting the macro to work in dbt_project.yml / schema.yml or by some other method?
Regards,
Ashley
dbt only allows a small subset of jinja in .yml files. In particular, you can't use macros. But you can use simple conditionals. Jinja that appears in .yml files must be quoted:
schema: "{{ 'prod_schema' if target.name == 'production' else 'dev_schema' }}"
Another option for you is to override the built-in macro that generates schema names. There is a great write-up in the dbt docs on this topic.
From the docs:
If your dbt project includes a macro that is also named generate_schema_name, dbt will always use the macro in your dbt project instead of the default macro.
Therefore, to change the way dbt generates a schema name, you should add a macro named generate_schema_name to your project, where you can then define your own logic.
There is even an alternative "non-default" version of this macro that ships with dbt, called generate_schema_name_for_env, with the logic:
In prod:
If a custom schema is provided, a model's schema name should match the custom schema, rather than being concatenated to the target schema.
If no custom schema is provided, a model's schema name should match the target schema.
In other environments (e.g. dev or qa):
Build all models in the target schema, as in, ignore custom schema configurations.
To use generate_schema_name_for_env, you create a new macro in your project with the following contents:
-- put this in macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{{ generate_schema_name_for_env(custom_schema_name, node) }}
{%- endmacro %}
EDIT: In Spark, you can use a similar trick to set the "location" of the materialized model by overriding the location_clause macro (which is part of the dbt-spark adapter). Your macro should template to a string with the word "location" followed by a path wrapped in single quotes:
{% macro location_clause() %}
{%- set location_root = config.get('location_root', validator=validation.any[basestring]) -%}
{%- set identifier = model['alias'] -%}
{%- if location_root is not none and target.name == "production" %}
location '{{ location_root }}/prod/{{ identifier }}'
{%- elif location_root is not none %}
location '{{ location_root }}/dev/{{ identifier }}'
{%- endif %}
{%- endmacro -%}

How do I run SQL model in dbt multiple times by looping through variables?

I have a model in dbt (test_model) that accepts a geography variable (zip, state, region) in the configuration. I would like to run the model three times by looping through the variables, each time running it with a different variable.
Here's the catch: I have a macro shown below that appends the variable to the end of the output table name (i.e., running test_model with zip as the variable outputs a table called test_model_zip). This is accomplished by adding {{ config(alias=var('geo')) }} at the top of the model.
Whether I define the variable within dbt_project.yml, the model itself, or on the CLI, I've been unable to find a way to loop through these variables, each time passing the new variable to the configuration, and successfully create three tables. Do any of you have an idea how to accomplish this? FWIW, I'm using BigQuery SQL.
The macro:
{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
{%- if custom_alias_name is none -%}
{{ node.name }}
{%- else -%}
{% set node_name = node.name ~ '_' ~ custom_alias_name %}
{{ node_name | trim }}
{%- endif -%}
{%- endmacro %}
The model, run by entering dbt run --select test_model.sql --vars '{"geo": "zip"}' in the CLI:
{{ config(materialized='table', alias=var('geo')) }}
with query as (select 1 as id)
select * from query
The current output: a single table called test_model_zip.
The desired output: three tables called test_model_zip, test_model_state, and test_model_region.
I would flip this on its head.
dbt doesn't really have a concept for parameterized models, so if you materialize a single model in multiple places, you'll lose lineage (the DAG relationship) and docs/etc. will get all confused.
Much better to create multiple model files that simply call a macro with a different parameter, like this:
geo_model_macro.sql
{% macro geo_model_macro(grain) %}
select
{{ grain }},
count(*)
from {{ ref('my_upstream_table') }}
group by 1
{% endmacro %}
test_model_zip.sql
{{ geo_model_macro('zip') }}
test_model_state.sql
{{ geo_model_macro('state') }}
test_model_region.sql
{{ geo_model_macro('region') }}
If I needed to do this hundreds of times (instead of 3), I would either:
Create a script to generate all of these .sql files for me
Create a new materialization that accepted a list of parameters, but this would be a super-advanced, here-be-dragons approach that is probably only appropriate when you've maxed out your other options.

Change materialization name(-prefix) of seed data in the warehouse

Currently the seed are automaticaly generated in the warehouse with the name dbt_{schema_name}_seed_data, with {schema_name} being the schema name specified in the profiles.yml.
I want to specify a different name, e.g. dbt_processing_seed_data, without changing the schema name in profile.yml to 'processing'.
Reason behind all this, different devs want to have their own schema so they don't interfere with each other. But it is unnecessary that the (same) seed data is stored multiple times in the warehouse.
You can set the schema for a seed in your dbt_project.yml file. See the docs.
To get the behavior you describe, where the target name is not prepended to the schema, you need to override the generate_schema_name macro by creating a new macro with that name in your project. Docs on that are here. You can use the node's resource type so that this behavior is only applied to seeds.
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}{{ default_schema }}
{%- elif node.resource_type == "seed" -%}{{ custom_schema_name | trim }}
{%- else -%}{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
I'd caution against this, though. Seeds are version-controlled, and really aren't intended to be used for large raw datasets (see the docs again). Since they get checked in alongside code, they should really share the same separation of environments that the code has.

dbt: How Can I Write Source Tables Into Their Own Schema Without Production Schema Prefix?

I'm trying to follow Gitlab's folder and dbt structure. Specifically for sources they've got a separate schema for each of their source tables. My production schema is called analytics and my production database is called analytics. When I run this in production dbt will create analytics.analytics_sfdc instead of analytics.sfdc. How can I set this up so that the source tables are written to analytics.sfdc?
Thanks!
The schema prefix/suffix setup is default in dbt. You can override it by changing the generate_schema_name macro in your project, as outlined here.
This is the code for the default version of the macro:
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}`
You can see the prefix logic in there. To override it, you simply need to create a new version of the macro in your project.
Assuming your production environment is a target called 'prod', it can be as simple as adding this:
{% macro generate_schema_name(custom_schema_name, node) -%}
{{ generate_schema_name_for_env(custom_schema_name, node) }}
{%- endmacro %}
As per the docs, this will behave as follows, which appears to be what you want:
In prod:
If a custom schema is provided, a model's schema name should match the custom schema, rather than being concatenated to the target schema.
If no custom schema is provided, a model's schema name should match the target schema.
In other environments (e.g. dev or qa):
Build all models in the target schema, as in, ignore custom schema configurations.
Alternatively, you can alter the logic of the first code snippet to do something more custom to your specific setup.
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
I use the above as my macros/generate_schema_name.sql to get rid of dbt prefixing your stuff irrespective of the environment

dbt cannot create two resources with identical database representations

I have a situation here as below:
There are two models in my dbt project
model-A
{{ config(
materialized='ephemeral',
alias='A_0001',
schema=var('xxx_yyy_dataset')
) }}
model-B
{{ config(
materialized='ephemeral',
alias='B_0002',
schema=var('xxx_yyy_dataset')
) }}
And these are getting materialized as incremental in same schema as xxx_yyy_dataset.Table_DDD
{{ config(
materialized='incremental',
alias='Table_DDD',
schema=var('xxx_yyy_dataset')
) }}
SELECT * FROM {{ref('A_0001')}}
UNION ALL
SELECT * FROM {{ref('B_0002')}}
This is working fine and it is ingesting records into target table.
Now I have introduced another model - model-C ind different package
model-C
{{ config(
materialized='incremental',
alias='Table_DDD',
schema=var('xxx_yyy_dataset')
) }}
This gives me the following error:
$ dbt compile --profiles-dir=profile --target ide
Running with dbt=0.16.0
Encountered an error:
Compilation Error
dbt found two resources with the database representation "xxx_yyy_dataset.Table_DDD".
dbt cannot create two resources with identical database representations. To fix this,
change the "schema" or "alias" configuration of one of these resources:
- model.eplus_rnc_dbt_project.conrol_outcome_joined (models/controls/payment/fa-join/conrol_outcome_joined.sql)
- model.eplus_rnc_dbt_project.dq_control_outcome_joined (models/controls/dq/dq-join/dq_control_outcome_joined.sql)
I have configured macro for custom macro as below :
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name }}
{%- endif -%}
{%- endmacro %}
{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
{%- if custom_alias_name is none -%}
{{ node.name }}
{%- else -%}
{{ custom_alias_name | trim }}
enter code here
{%- endif -%}
{%- endmacro %}
dbt is doing its job here!
You have two models that share the exact same configuration — conrol_outcome_joined and dq_control_outcome_joined.
This means that they'll both try to write to the same table: xxx_yyy_dataset.Table_DDD.
dbt is (rightfully) throwing an error here to avoid a problem.
As the error message suggests, you should update one of your models to use a different schema or alias so that it gets represented in your BigQuery project as a separate table.
I had been struggling with the same problem here, I wanted to create a pipeline of tests that would only be written to a single incremental table and it triggers the same error message, but I am afraid it is not possible with DBT.
To resolve it, I created a main model that selects and unions the info from all the individual test models that I created (I previously created a model/table for each test to be applied) and that in the end with the post_hook I just drop the individual tables previously created, thus, I only stick to a final testing table that keeps all the information.
It is not what I really desired since it is not a dynamic implementation, because every test that is created needs to be added to the main table union and also to the drop statement in the post_hook, however if any test breaks individually it would not break all the other tests, neither a bunch of tables exists in my database when I start my work, you just need to orchestrate it at the right time for you.
(Another possible approach could be creating 1 model, where in the pre_hook, you create all the tables that you want, since dbt cannot make models write to the same table, in the "main" part of the model, you select and union the info of all the pre-hook tables, and then in the post-hook you delete the tables created before, not sure if this can work, not tested, but you do reduce the amount of tables written to the Database, which is the main drawback of the 1st approach although for a short period of time)