DBT macro for repetitive task - dbt

I am a beginner in DBT. I have a requirement where I have created an Incremental model like below. I need to execute the same Incremental model logic statements for different systems.
There are 3 variables or parameters that I need to pass. i.e. for each run the ATTRIBUTE_NAME, VIEW_NAME, SYSTEM_NAME will need to be passed. For the next run, all the 3 parameters would be different.
However, for a particular SYSTEM_NAME, the VIEW_NAME and ATTRIBUTE_NAME are fixed.
Please help me to execute the dbt run using a macro for this requirement and pass the different system names and their corresponding view names and attribute names. Objective is to use single dbt run statement and execute this model for all ATTRIBUTE_NAME, VIEW_NAME, SYSTEM_NAME.
For now, I have defined variable and execute each run separately for each systems like below in CLI
e.g.
dbt run --vars '{"VIEW_NAME": CCC, "SYSTEM_NAME": BBBB, "ATTRIBUTE_NAME": AAAA}' -m incremental_modelname
dbt run --vars '{"VIEW_NAME": DDD, "SYSTEM_NAME": FFF, "ATTRIBUTE_NAME": HHH}' -m incremental_modelname
dbt run --vars '{"VIEW_NAME": EEE, "SYSTEM_NAME": GGG, "ATTRIBUTE_NAME": III}' -m incremental_modelname
Re-usuable Incremental model:
{{
config(
materialized='incremental',
transient=false,
unique_key='composite_key',
post_hook="insert into table (col1, col2, col3)
select
'{{ var('ATTRIBUTE_NAME') }}',
col2,
col3
from {{ this }} a
join table b on a=b
where b.SYSTEM_NAME='{{ var('SYSTEM_NAME') }}';
commit;"
)
}}
with name1 AS (
select
*
from {{ var('VIEW_NAME') }}
),
select
*
from name1
{% if is_incremental() %}
where (select timestamp_column from {{ var('VIEW_NAME') }}) >
(select max(timestamp_column) from {{ this }} where SYSTEM_NAME='{{ var("SYSTEM_NAME") }}')
{% endif %}

The easiest way would be to:
Create a model(or even a seed) that holds the system name, view name and attribute name.
Within your code, add a for loop
{% set query %}
select system_name, view_name, attribute_name from model_name
{% endset %}
{% set results = run_query(query) %}
{% for result in results %}
/*
Put your query here but reference the columns needed
results.columns[0].values() = system_name
results.columns[1].values() = view_name
results.columns[2].values() = attribute_name
*/

Related

How to access BigQuery table metadata in DBT using jinja?

I'd like to access the last modified time column from the metadata of a BigQuery table that acts as a source. I want to create a generic test that checks if the last modified date of the source table is equal to today.
In BigQuery you can access this data in this way:
SELECT
last_modified_time
FROM `project.dataset.__TABLES__`
WHERE table_id = 'table_id'
My goal is to make the project.dataset dynamic depending on the model this test is applied to. Similarly, I'd like for table_id to be dynamic.
Given that DBT mentions on their documentation that the dataset of BigQuery is similar in definition to 'schema', I tried this but it didn't work.
{% test last_modified_time(schema, model) %}
SELECT
last_modified_time
FROM `{{ database }}.{{ schema }}.__TABLES__`
WHERE table_id = {{ model }}
{% endtest %}
What this does is it renders the project name for both database and schema.
Also, model will (of course) render the project.dataset.table_id path while I only need the table_id.
I'm fairly new to DBT but I couldn't find anything that resembles what I'm looking for.
I tinkered with your solution for a little bit and this works flawlessly.
Thank you so much!
{% test last_modified_time(model) %}
WITH t AS (
SELECT DATE(TIMESTAMP_MILLIS(last_modified_time)) AS lmt
FROM `{{ model.database }}.{{ model.schema }}.__TABLES__`
WHERE table_id = '{{ model.identifier }}'
)
SELECT
lmt
FROM t
WHERE lmt < CURRENT_DATE()
{% endtest %}
There are a few changes you need to make:
Your generic test is accepting an argument named schema, which dbt won't provide when you execute the test. The test should only accept model, and then you'll want to config your yaml file so the test is on the model (not on a column):
models:
- name: my_model
tests:
- last_modified_time
The model argument is a Relation, and you can use that to grab the database/project, schema/dataset, and identifier (the materialized name) of the model.
Tests fail if they return any records. So your test will always fail. You need to compare the last_modified_time to the current date, and only return records that are older than the current date.
Putting that all together:
{% test last_modified_time(model) %}
with t as (
SELECT last_modified_time
FROM `{{ model.database }}.{{ model.schema }}.__TABLES__`
WHERE table_id = '{{ model.identifier }}'
)
select *
from t
where t.last_modified_time < current_date()
{% endtest %}

How to iterate through the data of a column in a sql case statement in dbt?

Newbie in dbt here.
I need to do a case statement, something like this:
case when PROPERTY_NAME = 'xxx' and EVENT_CATEGORY = 'xxx' and EVENT_ACTION LIKE '%xxx%' and EVENT_LABEL like '%xxx%'
then 'xxx'
(...)
on property name, I need to iterate through a list of data from a column of a table.
Is it doable though a macro?
To get data into the jinja context, you can use the run_query macro to execute arbitrary SQL and return the results as an Agate table. There is an example in the dbt docs for how to use that to return the distinct values from a column.
This use case is so common, there is also a macro in dbt-utils for it called get_column_values. This macro returns the distinct values from a column as an array (list). The example from those docs:
-- Returns a list of the payment_methods in the stg_payments model_
{% set payment_methods = dbt_utils.get_column_values(table=ref('stg_payments'), column='payment_method') %}
{% for payment_method in payment_methods %}
...
{% endfor %}
...
(Note that you need to first install the dbt-utils package. Docs for that here).
If you are trying to check membership in this list of values, you could do something like this:
{% set properties = dbt_utils.get_column_values(table=ref('your_model'), column='property') %}
{% set properties_str = properties | join("', '") %}
case
when
PROPERTY_NAME in ('{{ properties_str }}')
and EVENT_CATEGORY = 'xxx'
and EVENT_ACTION LIKE '%xxx%'
and EVENT_LABEL like '%xxx%'
then 'xxx'
...
Or if you want to iterate over that list:
{% set properties = dbt_utils.get_column_values(table=ref('your_model'), column='property') %}
case
{% for property in properties %}
when
PROPERTY_NAME = '{{ property }}'
and EVENT_CATEGORY = 'xxx'
and EVENT_ACTION LIKE '%xxx%'
and EVENT_LABEL like '%xxx%'
then '{{ property }}'
{% endfor %}
...
I think the best strategy depends on what you want to do. If all you need is the list of all columns, you could use something like get_columns_in_relation and use the results to loop over in your case statement:
{%- set columns = adapter.get_columns_in_relation(ref('model')) -%}
case
when
{% for column in columns %}
{{column.name}} = 'xxx' {{ 'and' if not loop.last }}
{% endfor %}
...
If you don't need every column, you could either exclude some columns from the resulting list or (better IMO) just define the columns you need in a jinja variable and loop over those.
If you need the data from one of the columns, you can use the (similar) run_query macro (or the get_column_values macro in dbt-utils). These have the same pattern of use - ie. retrieve something into the jinja layer of dbt and then use that layer to template out some sql.

dbt macro: How to join tables on multiple columns in a loop

I'm writing a dbt model to join two tables on multiple columns, compiled to something like this:
SELECT
A.col1,
A.col2,
A.col3,
FROM
A
LEFT JOIN
B
ON
(A.col1 = B.col1 OR (IS_NAN(A.col1) AND IS_NAN(B.col1))
AND (A.col2 = B.col2 OR (IS_NAN(A.col2) AND IS_NAN(B.col2))
AND (A.col3 = B.col3 OR (IS_NAN(A.col3) AND IS_NAN(B.col3))
and this logic will be applied to many table pairs, so I need a macro. The joining logic is the same on all columns, so a loop over columns in the ON clause would be perfect, like this
SELECT
{% for col in all_cols %}
A.{{ col }},
{% endfor %}
FROM
A
LEFT JOIN
B
ON
{% for col in all_cols %}
(A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})),
<-- What to put here for AND the next condition???
{% endfor %}
How can I concatenate the conditions in ON clause with AND when iterating over columns?
The cute way (add a predicate that is always true, so you can start every statement with AND):
SELECT
{% for col in all_cols %}
A.{{ col }},
{% endfor %}
FROM
A
LEFT JOIN
B
ON
1=1
{% for col in all_cols %}
AND (A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})))
{% endfor %}
The less-cute way, using loop.first (loop is a variable set by jinja inside a for loop that has some handy properties. loop.first and loop.last are especially useful):
SELECT
{% for col in all_cols %}
A.{{ col }},
{% endfor %}
FROM
A
LEFT JOIN
B
ON
{% for col in all_cols %}
{% if not loop.first %}AND{% endif %} (A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})))
{% endfor %}
Your sample query is missing several ) in the on statement.
Since you asked for BigQuery, I show here a route to do the task directly with BigQuery without using the dbt tool.
First generate a dataset Test and two tables A and B
CREATE OR REPLACE TABLE
Test.B AS
SELECT
IF
(RAND()>0.5,NULL,RAND()) col1,
IF
(RAND()>0.5,NULL,RAND()) col2,
IF
(RAND()>0.5,NULL,RAND()) col3
FROM
UNNEST(GENERATE_ARRAY(1,100)) a
Then run this query in the same region, has to be set manually if not US.
DECLARE col_list ARRAY<STRING> ;
DECLARE col_list_A STRING ;
DECLARE col_list_B STRING ;
DECLARE col_list_on STRING ;
EXECUTE IMMEDIATE
"Select array_agg(column_name) from Test.INFORMATION_SCHEMA.COLUMNS where TABLE_NAME='A'" INTO col_list;
EXECUTE IMMEDIATE
"Select STRING_AGG(concat('A.',cols)) FROM UNNEST(?) cols" INTO col_list_A USING col_list;
EXECUTE IMMEDIATE
"Select STRING_AGG(concat('B.',cols)) FROM UNNEST(?) cols" INTO col_list_B USING col_list;
EXECUTE IMMEDIATE
"Select STRING_AGG(concat('(A.',cols,' = B.',cols,' OR (IS_NAN(A.',cols,') AND IS_NAN(B.',cols,')) ) '),' AND ') FROM UNNEST(?) cols" INTO col_list_on USING col_list;
EXECUTE IMMEDIATE
"SELECT " || col_list_A || "," || col_list_B || " FROM Test.A LEFT JOIN Test.B ON " || col_list_on
First DECLARE all variables. Then query the column name of table A to variable col_list. Use concat to build the A.col1, A.col2 ... list and then for B. as well. The concat is again used for the ON conditions.
Finally all variables are put into query.
I would like to warn that this final query will perform poor on larger tables. In cases this is an issue for you, please fell free to ask another question, given more details about your goal.

dbt if/else macros return nothing

I'm trying to use a dbt macro to transform survey results.
I have a table similar to:
column1
column2
often
sometimes
never
always
...
...
I want to transform it into:
column 1
column 2
3
2
1
4
...
...
using the following mapping:
category
value
always
4
often
3
sometimes
2
never
1
To do so I have written the following sbt macro:
{% macro class_to_score(class) %}
{% if class == "always" %}
{% set result = 1 %}
{% elif class == "often" %}
{% set result = 2 %}
{% elif class == "sometimes" %}
{% set result = 3 %}
{% elif class == "never" %}
{% set result = 4 %}
{% endif -%}
{{ return(result) }}
{% endmacro %}
and then the following sql query:
{%- set class_to_score = class_to_score -%}
select
{{ set_class_to_score(column1) }} as column1_score,
from
table
However, I get the error:
Syntax error: SELECT list must not be empty at [5:1]
Anyone know why I am not getting anything back?
Thanks for the time you took to communicate your question. It's not easy! It looks like you're experiencing the number one misconception when it comes to dbt and Jinja:
Jinja isn't about transforming data, it's about composing a single SQL query that will be sent to the database. After everything inside jinja's curly brackets is rendered, you will be left with a query that can be sent to the database.
This notion does get complicated with dbt macros like run_query (docs) which will go to the database and get information. But the info you fetch can only used to generate the SQL string.
Your example sounds like the way to do things if you're using Python's pandas where the transformations happens in memory. In dbt-land, only the string generation happens in memory, though sometimes we get some of the substrings from the database before making the new query. So it sounds like you'd like Jinja to look at every value in the column and make the substitution, what you really need to do be doing is make generate a query that instructs the database to make the substitution. The way we do substitution in SQL is with CASE WHEN statements (see Mode's CASE docs for more info)
This is probably closer to what you want. Note it's probably better to make the likert_map object into a dbt seed table.
{% set likert_map =
{"1": "always", "2": "often", "3": "sometimes", "4": "never"} %}
SELECT
CASE column_1
{% for key, value in likert_map.items() %}
WHEN '{{ value }}' THEN '{{ key }}'
{% endfor %}
ELSE 0 END AS column_1_new,
CASE column_2
{% for key, value in likert_map.items() %}
WHEN '{{ value }}' THEN '{{ key }}'
{% endfor %}
ELSE 0 END AS column_2_new
{% endfor %}
FROM
table
Here's some related questions using mapping dictionary information to make a SQL query:
How to join two tables into a dictionary in dbt jinja
DBT - for loop issue with number as variable

Assign value of a column to variable in sql use jinja template language

I have a sql file like this to transform a table has a column include a json string
{{ config(materialized='table') }}
with customer_orders as (
select
time,
data as jsonData,
{% set my_dict = fromjson( jsonData ) %}
{% do log("Printout: " ~ my_dict, info=true) %}
from `warehouses.raw_data.customer_orders`
limit 5
)
select *
from customer_orders
When I run dbt run, it return like this:
Running with dbt=0.21.0
Encountered an error:
the JSON object must be str, bytes or bytearray, not Undefined
I even can not print out the value of column I want:
{{ config(materialized='table') }}
with customer_orders as (
select
time,
tag,
data as jsonData,
{% do log("Printout: " ~ data, info=true) %}
from `warehouses.raw_data.customer_orders`
limit 5
)
select *
from customer_orders
22:42:58 | Concurrency: 1 threads (target='dev')
22:42:58 |
Printout:
22:42:58 | Done.
But if I create another model to printout the values of jsonData:
{%- set payment_methods = dbt_utils.get_column_values(
table=ref('customer_orders_model'),
column='jsonData'
) -%}
{% do log(payment_methods, info=true) %}
{% for json in payment_methods %}
{% set my_dict = fromjson(json) %}
{% do log(my_dict, info=true) %}
{% endfor %}
It print out the json value I want
Running with dbt=0.21.0
This is log
Found 2 models, 0 tests, 0 snapshots, 0 analyses, 372 macros, 0 operations, 0 seed files, 0 sources, 0 exposures
21:41:15 | Concurrency: 1 threads (target='dev')
21:41:15 |
['{"log": "ok", "path": "/var/log/containers/...log", "time": "2021-10-26T08:50:52.412932061Z", "offset": 527, "stream": "stdout", "#timestamp": 1635238252.412932}']
{'log': 'ok', 'path': '/var/log/containers/...log', 'time': '2021-10-26T08:50:52.412932061Z', 'offset': 527, 'stream': 'stdout', '#timestamp': 1635238252.412932}
21:41:21 | Done.
But I want to process this jsonData with in a model file like customer_orders_model above.
How can I get value of a column and assign it to a variable and continue to process whatever I want (check if in json have a key I want and set it value to new column).
Notes: My purpose is that: In my table, has a json string column, I want extract this json string column into many columns so I can easily write sql query what I want.
In case BigQuery database, Google has a JSON functions in Standard SQL
If your column is JSON string, I think you can use JSON_EXTRACT to get value of the key you want
EX:
with customer_orders as (
select
time,
tag,
data as jsonData,
json_extract(data, '$.log') AS log,
from `dc-warehouses.raw_data.logs_trackfoe_prod`
limit 5
)
select *
from customer_orders
You are very close! The thing to remember is that dbt and jinja is primarily for rendering text. Anything that isn't in curly brackets is just text strings.
So in your first example, data and jsonData are a substring of the larger query (that is also a string). So they aren't variables that Jinja knows about, which explains the error message that they are Undefined
with customer_orders as (
select
time,
data as jsonData,
from `warehouses.raw_data.customer_orders`
limit 5
)
select *
from customer_orders
This is why dbt_utils.get_column_values() works for you because that macro actually runs a query to get the data and your assigning the result to a variable. the run_query macro can be helpful for situations like this (and i'm fairly certain get_column_values uses run_query in the background).
In regards to your original question, you want to turn a JSON dict into multiple columns, I'd first recommend having your db do this directly. Many dbs have functions that let you do this. Primarily jinja is for generating SQL queries dynamically, not for manipulating data. Even if you could load all the JSON into jinja, I don't know how you'd write that back into a table without using something like a INSERT INTO VALUES statement which, IMHO, goes against the design principle of dbt.