How can I generate different SQL for dbt models during train vs test runs? - dbt

This question came up on dbt Slack (I've paraphrased here):
I want my dbt models to use different logic depending on whether I'm training or testing a data set. Here is my current code, which uses SQL variables:
SELECT
*
FROM tracks tr
WHERE
event_timestamp >= (CURRENT_DATE - INTERVAL %(from_inclusive)s) -
INTERVAL '60 days'
from_inclusive is a variable taken from this json:
{
"train": {
"from_inclusive": "15 days"
},
"train-extra": {
"from_inclusive": "1 days"
},
"eval-test": {
"from_inclusive": "1 days"
}
}
How can I write a dbt model so that from_inclusive compiles different SQL depending on a variable setting?

I think you can write a macro to do this.
Off the top of my head, the macro could accept one argument (e.g., "train" or "test"). The macro could then contain an if else statement and return the appropriate number of days depending on the argument. That macro could be used throughout your project and would result in appropriately compiled SQL.
Assuming you have some way of the model knowing whether it's a train or test (or other) run, you could feed the macro different arguments based on the schema and/or envi, like this:
select
*
from tracks tr
where event_timestamp >= (current_date - interval
{% if target.name == 'train' %}
{{ your_macro('train') }}
{%- else -%}
{{ your_macro('test') }}
{% endif %}
) - interval '60 days'
I used target as an example, but you could do the same thing with other states or configs to determine the argument that goes into the macro.
Your macro (your_macro) would then be stored in the macros subdirectory and would look like this:
{% macro your_macro(state) %}
{% if state == 'train' %}
"15 days"
{% elif state == 'test' %}
"1 day"
{% else %}
"0 days"
{% endif %}
{% endmacro %}

Related

dbt get value from agate.Row to string

I want to run a macro in a COPY INTO statement to S3 bucket. Apparently in snowflake I can't do dynamic path. So I'm doing a hacky way to solve this.
{% macro unload_snowflake_to_s3() %}
{# Get all tables and views from the information schema. #}
{%- set query -%}
select concat('COPY INTO #MY_STAGE/year=', year(current_date()), '/my_file FROM (SELECT OBJECT_CONSTRUCT(*) from my_table)');
{%- endset -%}
-- {%- set final_query = run_query(query) -%}
-- {{ dbt_utils.log_info(final_query) }}
-- {{ dbt_utils.log_info(final_query.rows.values()[0]) }}
{%- do run_query(final_query.columns.values()[0]) -%}
-- {% do final_query.print_table() %}
{% endmacro %}
Based on above macros, what I'm trying to do is:
Use CONCAT to add year in the bucket path. Hence, the query becomes a string.
Use the concatenated query to do run_query()again to actually run the COPY INTO statement.
Output and error I got from dbt log:
09:06:08 09:06:08 + | column | data_type |
| ----------------------------------------------------------------------------------------------------------- | --------- |
| COPY INTO #MY_STAGE/year=', year(current_date()), '/my_file FROM (SELECT OBJECT_CONSTRUCT(*) from my_table) | Text |
09:06:08 09:06:08 + <agate.Row: ('COPY INTO #MY_STAGE/year=2022/my_file FROM (SELECT OBJECT_CONSTRUCT(*) from my_table)')>
09:06:09 Encountered an error while running operation: Database Error
001003 (42000): SQL compilation error:
syntax error line 1 at position 0 unexpected '<'.
root#2c50ba8af043:/dbt#
I think the error is that I didn't extract the row and column specifically which is in agate format. How can I convert/extract this to string?
You might have better luck with dbt_utils.get_query_results_as_dict.
But you don't need to use your database to construct that path. The jinja context has a run_started_at variable that is a Python datetime object, so you can build your string in jinja, without hitting the database:
{% set yr = run_started_at.strftime("%Y") %}
{% set query = 'COPY INTO #MY_STAGE/year=' ~ yr ~ '/my_file FROM (SELECT OBJECT_CONSTRUCT(*) from my_table)' %}
Finally, depending on how you're calling this macro you probably want to gate this whole thing with an {% if execute %} flag, so dbt doesn't do the COPY when it's parsing your models.
You can use dbt_utils.get_query_results_as_dict function to get rid of agate part. Maybe after that your copy statement can work.
{%- set final_query = dbt_utils.get_query_results_as_dict(query) -%}
{{log(final_query ,true)}}
{% for keys,val in final_query.items() %}
{{log(keys,true)}}
{{log( val ,true)}}
{% endfor %}
if you run like this you will see ('COPY INTO #MY_STAGE/year=', year(current_date())...') and lastly remove "('')" by
{%- set final_val=val | replace('(', '')| replace(')', '') | replace("'", '') -%}```
That's it.

Can you add where clauses to a model based on an environment variable in dbt?

I would like to be able to add a conditional to a model based on the environment variable.
Something like:
SELECT ....
FROM ....
WHERE env_var = {{ env_var('DBT_VAR') }}
That way I can run this model for all my target schemas but have a where clause that allows me to do something different for a specific environment variable. So, if I had 4 different environment variables that all need the same model, but 2 of them needed an extra where clause, I wouldn't have to re-write the model 4 times... I could just use one and it would run depending on the environment variable.
Yes, this should work exactly as you've written it, assuming you have a field in your model also called env_var.
You could also branch using an if statement, based on an env var or target.name:
SELECT ....
FROM ....
{% if target.name != 'prod' %}
WHERE date_field > current_timestamp - interval '1 month'
{% endif %}
UPDATE:
You can also compare the target to a variable, if you'd rather not hard-code that. You can use either var or env_var -- the approach is very similar:
SELECT ....
FROM ....
{% if target.name != env_var('PROD_TARGET') %}
WHERE date_field > current_timestamp - interval '1 month'
{% endif %}
Or if you just want to check if an env_var is set:
SELECT ....
FROM ....
{% if env_var('THIS_IS_PROD') is not None %}
WHERE date_field > current_timestamp - interval '1 month'
{% endif %}

How to iterate through the data of a column in a sql case statement in dbt?

Newbie in dbt here.
I need to do a case statement, something like this:
case when PROPERTY_NAME = 'xxx' and EVENT_CATEGORY = 'xxx' and EVENT_ACTION LIKE '%xxx%' and EVENT_LABEL like '%xxx%'
then 'xxx'
(...)
on property name, I need to iterate through a list of data from a column of a table.
Is it doable though a macro?
To get data into the jinja context, you can use the run_query macro to execute arbitrary SQL and return the results as an Agate table. There is an example in the dbt docs for how to use that to return the distinct values from a column.
This use case is so common, there is also a macro in dbt-utils for it called get_column_values. This macro returns the distinct values from a column as an array (list). The example from those docs:
-- Returns a list of the payment_methods in the stg_payments model_
{% set payment_methods = dbt_utils.get_column_values(table=ref('stg_payments'), column='payment_method') %}
{% for payment_method in payment_methods %}
...
{% endfor %}
...
(Note that you need to first install the dbt-utils package. Docs for that here).
If you are trying to check membership in this list of values, you could do something like this:
{% set properties = dbt_utils.get_column_values(table=ref('your_model'), column='property') %}
{% set properties_str = properties | join("', '") %}
case
when
PROPERTY_NAME in ('{{ properties_str }}')
and EVENT_CATEGORY = 'xxx'
and EVENT_ACTION LIKE '%xxx%'
and EVENT_LABEL like '%xxx%'
then 'xxx'
...
Or if you want to iterate over that list:
{% set properties = dbt_utils.get_column_values(table=ref('your_model'), column='property') %}
case
{% for property in properties %}
when
PROPERTY_NAME = '{{ property }}'
and EVENT_CATEGORY = 'xxx'
and EVENT_ACTION LIKE '%xxx%'
and EVENT_LABEL like '%xxx%'
then '{{ property }}'
{% endfor %}
...
I think the best strategy depends on what you want to do. If all you need is the list of all columns, you could use something like get_columns_in_relation and use the results to loop over in your case statement:
{%- set columns = adapter.get_columns_in_relation(ref('model')) -%}
case
when
{% for column in columns %}
{{column.name}} = 'xxx' {{ 'and' if not loop.last }}
{% endfor %}
...
If you don't need every column, you could either exclude some columns from the resulting list or (better IMO) just define the columns you need in a jinja variable and loop over those.
If you need the data from one of the columns, you can use the (similar) run_query macro (or the get_column_values macro in dbt-utils). These have the same pattern of use - ie. retrieve something into the jinja layer of dbt and then use that layer to template out some sql.

Expressing complex logic in case statements in dbt using a lookup table

I'm reworking a good part of our Analytics Data Warehouse and am trying to build things out in a much more modular way. We've swapped to DBT vs an in house transformation tool and I'm trying to take advantage of the functionality it offers.
Previously, the way we classified our rental segments was in a series of CASE statements which evaluate a few fields. These are (sudocode)
CASE WHEN rental_rule_type <> monthly
AND rental_length BETWEEN 6 AND 24
AND rental_day IN (0,1,2,3,4)
AND rental_starts IN (5,6,7,8,9,10,11)
THEN weekday_daytime_rental
This obviously works. But it's ugly and hard to update. If we want to adjust this, we'll need to do so in the SQL rather than in a lookup table.
What I'd like to build is a simple lookup table that holds these values that can be adjusted at a later date to easily adjust how we classify these rentals, but I'm not sure what the best approach is.
My current thought is to layer these conditions into an excel file, load it into the warehouse with DBT and then join on these conditions, however I'm not sure if that would end up being cleaner logic or not. It would mean there are no hardcoded values in the code, but it would likely still result in a ton of ugly cases and joins.
I think there are some global variables I could define as well in DBT which may help with this?
Anyone approach something similar? Would love to hear some best practices.
Love the question here and actually tconbeer's answer as well.
However, if you want an answer that "favor's readability over DRY-ness", there is an another appropriate middle ground here which is generally regarded as a dbt best-practice: model defined sets.
Example:
{% set rental_rule_type = ["bank_transfer"] %}
{% set rental_length_low = ["6"] %}
{% set rental_length_high = ["24"] %}
{% set rental_days = ["0","1","2","3","4"] %}
{% set rental_starts = ["5","6","7","8","9","10","11"] %}
with some_cte as (
select * from {{ ref('some_source') }}
)
select *,
CASE
WHEN rental_rule_type <> {{rental_rule_type}}
AND rental_length BETWEEN {{rental_length_low}} AND {{rental_length_high}}
AND rental_day IN (
{% for rental_day in rental_days %}
{{rental_day}} {%- if not loop.last -%}, {%- endif -%}
{% endfor %}
)
AND rental_starts IN (
{% for rental_start in rental_starts %}
{{rental_start}} {%- if not loop.last -%}, {%- endif -%}
{% endfor %}
)
THEN weekday_daytime_rental
from some_cte
Example 2 (equivalent but cleaner as suggested in comment below):
{% set rental_rule_type = ["bank_transfer"] %}
{% set rental_length_low = ["6"] %}
{% set rental_length_high = ["24"] %}
{% set rental_days = ["0","1","2","3","4"] %}
{% set rental_starts = ["5","6","7","8","9","10","11"] %}
with some_cte as (
select * from {{ ref('some_source') }}
)
select *,
CASE
WHEN rental_rule_type <> {{rental_rule_type}}
AND rental_length BETWEEN {{rental_length_low}} AND {{rental_length_high}}
AND rental_day IN ( {{ rental_days | join(", ") }} )
AND rental_starts IN ( {{ rental_starts | join(", ") }} )
THEN weekday_daytime_rental
from some_cte
In this format, all the logic stays accessible to someone who is reading the model but also makes changing the logic much more accessible since it contains all the variables to a single location.
Also, much easier to see that variables / macros are in play here at a quick glance of the model vs if your case statement is buried deep in a chain of CTEs or a more complex select statement after some CTEs.
warning
I haven't compiled this so I'm not 100% it will be successful as-is. Should get you started though if you go in this direction.
I've tried what you describe, and I've generally regretted it.
Logic really should be expressed as code, not data. It should be source-controlled, reviewable, and support multiple environments (dev and prod).
Seed files (with dbt seed) are kind-of data, and kind-of code, since they get checked into source control alongside the code. This at least solves the multiple environments problem, but it makes code review extremely difficult.
I'd recommend doing what software engineers do -- encapsulate the logic into easily-understandable and easily-testable components, and then compose those components in your model. Macros work pretty well for this.
For example, your case statement above could become a macro called is_weekday_daytime_rental():
{% macro is_weekday_daytime_rental(rental_rule_type, rental_legnth, rental_day, rental_starts) %}
CASE
WHEN rental_rule_type <> monthly
AND rental_length BETWEEN 6 AND 24
AND rental_day IN (0,1,2,3,4)
AND rental_starts IN (5,6,7,8,9,10,11)
THEN true
ELSE false
END
{% endmacro %}
then you could call that macro in your model, like:
CASE
WHEN
{{ is_weekly_daytime_rental(
rental_rule_type,
rental_legnth,
rental_day,
rental_starts
) }}
THEN weekday_daytime_rental
WHEN ...
But let's do better. Assuming you're also going to have is_weekend_daytime_rental, then each of those component bits of logic should be its own macro that you can reuse:
{% macro is_weekday_daytime_rental(rental_rule_type, rental_legnth, rental_day, rental_starts) %}
CASE
WHEN {{ is_daily_rental(rental_rule_type, rental_length) }}
AND {{ is_weekday(rental_day) }}
AND {{ is_daytime(rental_starts) }}
THEN true
ELSE false
END
{% endmacro %}
where each component looks like:
{% macro is_weekday(day_number) %}
CASE
WHEN day_number IN (0, 1, 2, 3, 4)
THEN true
ELSE false
END
{% endmacro %}

dbt if/else macros return nothing

I'm trying to use a dbt macro to transform survey results.
I have a table similar to:
column1
column2
often
sometimes
never
always
...
...
I want to transform it into:
column 1
column 2
3
2
1
4
...
...
using the following mapping:
category
value
always
4
often
3
sometimes
2
never
1
To do so I have written the following sbt macro:
{% macro class_to_score(class) %}
{% if class == "always" %}
{% set result = 1 %}
{% elif class == "often" %}
{% set result = 2 %}
{% elif class == "sometimes" %}
{% set result = 3 %}
{% elif class == "never" %}
{% set result = 4 %}
{% endif -%}
{{ return(result) }}
{% endmacro %}
and then the following sql query:
{%- set class_to_score = class_to_score -%}
select
{{ set_class_to_score(column1) }} as column1_score,
from
table
However, I get the error:
Syntax error: SELECT list must not be empty at [5:1]
Anyone know why I am not getting anything back?
Thanks for the time you took to communicate your question. It's not easy! It looks like you're experiencing the number one misconception when it comes to dbt and Jinja:
Jinja isn't about transforming data, it's about composing a single SQL query that will be sent to the database. After everything inside jinja's curly brackets is rendered, you will be left with a query that can be sent to the database.
This notion does get complicated with dbt macros like run_query (docs) which will go to the database and get information. But the info you fetch can only used to generate the SQL string.
Your example sounds like the way to do things if you're using Python's pandas where the transformations happens in memory. In dbt-land, only the string generation happens in memory, though sometimes we get some of the substrings from the database before making the new query. So it sounds like you'd like Jinja to look at every value in the column and make the substitution, what you really need to do be doing is make generate a query that instructs the database to make the substitution. The way we do substitution in SQL is with CASE WHEN statements (see Mode's CASE docs for more info)
This is probably closer to what you want. Note it's probably better to make the likert_map object into a dbt seed table.
{% set likert_map =
{"1": "always", "2": "often", "3": "sometimes", "4": "never"} %}
SELECT
CASE column_1
{% for key, value in likert_map.items() %}
WHEN '{{ value }}' THEN '{{ key }}'
{% endfor %}
ELSE 0 END AS column_1_new,
CASE column_2
{% for key, value in likert_map.items() %}
WHEN '{{ value }}' THEN '{{ key }}'
{% endfor %}
ELSE 0 END AS column_2_new
{% endfor %}
FROM
table
Here's some related questions using mapping dictionary information to make a SQL query:
How to join two tables into a dictionary in dbt jinja
DBT - for loop issue with number as variable