dbt macro: How to join tables on multiple columns in a loop - sql

I'm writing a dbt model to join two tables on multiple columns, compiled to something like this:
SELECT
A.col1,
A.col2,
A.col3,
FROM
A
LEFT JOIN
B
ON
(A.col1 = B.col1 OR (IS_NAN(A.col1) AND IS_NAN(B.col1))
AND (A.col2 = B.col2 OR (IS_NAN(A.col2) AND IS_NAN(B.col2))
AND (A.col3 = B.col3 OR (IS_NAN(A.col3) AND IS_NAN(B.col3))
and this logic will be applied to many table pairs, so I need a macro. The joining logic is the same on all columns, so a loop over columns in the ON clause would be perfect, like this
SELECT
{% for col in all_cols %}
A.{{ col }},
{% endfor %}
FROM
A
LEFT JOIN
B
ON
{% for col in all_cols %}
(A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})),
<-- What to put here for AND the next condition???
{% endfor %}
How can I concatenate the conditions in ON clause with AND when iterating over columns?

The cute way (add a predicate that is always true, so you can start every statement with AND):
SELECT
{% for col in all_cols %}
A.{{ col }},
{% endfor %}
FROM
A
LEFT JOIN
B
ON
1=1
{% for col in all_cols %}
AND (A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})))
{% endfor %}
The less-cute way, using loop.first (loop is a variable set by jinja inside a for loop that has some handy properties. loop.first and loop.last are especially useful):
SELECT
{% for col in all_cols %}
A.{{ col }},
{% endfor %}
FROM
A
LEFT JOIN
B
ON
{% for col in all_cols %}
{% if not loop.first %}AND{% endif %} (A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})))
{% endfor %}

Your sample query is missing several ) in the on statement.
Since you asked for BigQuery, I show here a route to do the task directly with BigQuery without using the dbt tool.
First generate a dataset Test and two tables A and B
CREATE OR REPLACE TABLE
Test.B AS
SELECT
IF
(RAND()>0.5,NULL,RAND()) col1,
IF
(RAND()>0.5,NULL,RAND()) col2,
IF
(RAND()>0.5,NULL,RAND()) col3
FROM
UNNEST(GENERATE_ARRAY(1,100)) a
Then run this query in the same region, has to be set manually if not US.
DECLARE col_list ARRAY<STRING> ;
DECLARE col_list_A STRING ;
DECLARE col_list_B STRING ;
DECLARE col_list_on STRING ;
EXECUTE IMMEDIATE
"Select array_agg(column_name) from Test.INFORMATION_SCHEMA.COLUMNS where TABLE_NAME='A'" INTO col_list;
EXECUTE IMMEDIATE
"Select STRING_AGG(concat('A.',cols)) FROM UNNEST(?) cols" INTO col_list_A USING col_list;
EXECUTE IMMEDIATE
"Select STRING_AGG(concat('B.',cols)) FROM UNNEST(?) cols" INTO col_list_B USING col_list;
EXECUTE IMMEDIATE
"Select STRING_AGG(concat('(A.',cols,' = B.',cols,' OR (IS_NAN(A.',cols,') AND IS_NAN(B.',cols,')) ) '),' AND ') FROM UNNEST(?) cols" INTO col_list_on USING col_list;
EXECUTE IMMEDIATE
"SELECT " || col_list_A || "," || col_list_B || " FROM Test.A LEFT JOIN Test.B ON " || col_list_on
First DECLARE all variables. Then query the column name of table A to variable col_list. Use concat to build the A.col1, A.col2 ... list and then for B. as well. The concat is again used for the ON conditions.
Finally all variables are put into query.
I would like to warn that this final query will perform poor on larger tables. In cases this is an issue for you, please fell free to ask another question, given more details about your goal.

Related

dbt get value from agate.Row to string

I want to run a macro in a COPY INTO statement to S3 bucket. Apparently in snowflake I can't do dynamic path. So I'm doing a hacky way to solve this.
{% macro unload_snowflake_to_s3() %}
{# Get all tables and views from the information schema. #}
{%- set query -%}
select concat('COPY INTO #MY_STAGE/year=', year(current_date()), '/my_file FROM (SELECT OBJECT_CONSTRUCT(*) from my_table)');
{%- endset -%}
-- {%- set final_query = run_query(query) -%}
-- {{ dbt_utils.log_info(final_query) }}
-- {{ dbt_utils.log_info(final_query.rows.values()[0]) }}
{%- do run_query(final_query.columns.values()[0]) -%}
-- {% do final_query.print_table() %}
{% endmacro %}
Based on above macros, what I'm trying to do is:
Use CONCAT to add year in the bucket path. Hence, the query becomes a string.
Use the concatenated query to do run_query()again to actually run the COPY INTO statement.
Output and error I got from dbt log:
09:06:08 09:06:08 + | column | data_type |
| ----------------------------------------------------------------------------------------------------------- | --------- |
| COPY INTO #MY_STAGE/year=', year(current_date()), '/my_file FROM (SELECT OBJECT_CONSTRUCT(*) from my_table) | Text |
09:06:08 09:06:08 + <agate.Row: ('COPY INTO #MY_STAGE/year=2022/my_file FROM (SELECT OBJECT_CONSTRUCT(*) from my_table)')>
09:06:09 Encountered an error while running operation: Database Error
001003 (42000): SQL compilation error:
syntax error line 1 at position 0 unexpected '<'.
root#2c50ba8af043:/dbt#
I think the error is that I didn't extract the row and column specifically which is in agate format. How can I convert/extract this to string?
You might have better luck with dbt_utils.get_query_results_as_dict.
But you don't need to use your database to construct that path. The jinja context has a run_started_at variable that is a Python datetime object, so you can build your string in jinja, without hitting the database:
{% set yr = run_started_at.strftime("%Y") %}
{% set query = 'COPY INTO #MY_STAGE/year=' ~ yr ~ '/my_file FROM (SELECT OBJECT_CONSTRUCT(*) from my_table)' %}
Finally, depending on how you're calling this macro you probably want to gate this whole thing with an {% if execute %} flag, so dbt doesn't do the COPY when it's parsing your models.
You can use dbt_utils.get_query_results_as_dict function to get rid of agate part. Maybe after that your copy statement can work.
{%- set final_query = dbt_utils.get_query_results_as_dict(query) -%}
{{log(final_query ,true)}}
{% for keys,val in final_query.items() %}
{{log(keys,true)}}
{{log( val ,true)}}
{% endfor %}
if you run like this you will see ('COPY INTO #MY_STAGE/year=', year(current_date())...') and lastly remove "('')" by
{%- set final_val=val | replace('(', '')| replace(')', '') | replace("'", '') -%}```
That's it.

How to iterate through the data of a column in a sql case statement in dbt?

Newbie in dbt here.
I need to do a case statement, something like this:
case when PROPERTY_NAME = 'xxx' and EVENT_CATEGORY = 'xxx' and EVENT_ACTION LIKE '%xxx%' and EVENT_LABEL like '%xxx%'
then 'xxx'
(...)
on property name, I need to iterate through a list of data from a column of a table.
Is it doable though a macro?
To get data into the jinja context, you can use the run_query macro to execute arbitrary SQL and return the results as an Agate table. There is an example in the dbt docs for how to use that to return the distinct values from a column.
This use case is so common, there is also a macro in dbt-utils for it called get_column_values. This macro returns the distinct values from a column as an array (list). The example from those docs:
-- Returns a list of the payment_methods in the stg_payments model_
{% set payment_methods = dbt_utils.get_column_values(table=ref('stg_payments'), column='payment_method') %}
{% for payment_method in payment_methods %}
...
{% endfor %}
...
(Note that you need to first install the dbt-utils package. Docs for that here).
If you are trying to check membership in this list of values, you could do something like this:
{% set properties = dbt_utils.get_column_values(table=ref('your_model'), column='property') %}
{% set properties_str = properties | join("', '") %}
case
when
PROPERTY_NAME in ('{{ properties_str }}')
and EVENT_CATEGORY = 'xxx'
and EVENT_ACTION LIKE '%xxx%'
and EVENT_LABEL like '%xxx%'
then 'xxx'
...
Or if you want to iterate over that list:
{% set properties = dbt_utils.get_column_values(table=ref('your_model'), column='property') %}
case
{% for property in properties %}
when
PROPERTY_NAME = '{{ property }}'
and EVENT_CATEGORY = 'xxx'
and EVENT_ACTION LIKE '%xxx%'
and EVENT_LABEL like '%xxx%'
then '{{ property }}'
{% endfor %}
...
I think the best strategy depends on what you want to do. If all you need is the list of all columns, you could use something like get_columns_in_relation and use the results to loop over in your case statement:
{%- set columns = adapter.get_columns_in_relation(ref('model')) -%}
case
when
{% for column in columns %}
{{column.name}} = 'xxx' {{ 'and' if not loop.last }}
{% endfor %}
...
If you don't need every column, you could either exclude some columns from the resulting list or (better IMO) just define the columns you need in a jinja variable and loop over those.
If you need the data from one of the columns, you can use the (similar) run_query macro (or the get_column_values macro in dbt-utils). These have the same pattern of use - ie. retrieve something into the jinja layer of dbt and then use that layer to template out some sql.

DBT macro for repetitive task

I am a beginner in DBT. I have a requirement where I have created an Incremental model like below. I need to execute the same Incremental model logic statements for different systems.
There are 3 variables or parameters that I need to pass. i.e. for each run the ATTRIBUTE_NAME, VIEW_NAME, SYSTEM_NAME will need to be passed. For the next run, all the 3 parameters would be different.
However, for a particular SYSTEM_NAME, the VIEW_NAME and ATTRIBUTE_NAME are fixed.
Please help me to execute the dbt run using a macro for this requirement and pass the different system names and their corresponding view names and attribute names. Objective is to use single dbt run statement and execute this model for all ATTRIBUTE_NAME, VIEW_NAME, SYSTEM_NAME.
For now, I have defined variable and execute each run separately for each systems like below in CLI
e.g.
dbt run --vars '{"VIEW_NAME": CCC, "SYSTEM_NAME": BBBB, "ATTRIBUTE_NAME": AAAA}' -m incremental_modelname
dbt run --vars '{"VIEW_NAME": DDD, "SYSTEM_NAME": FFF, "ATTRIBUTE_NAME": HHH}' -m incremental_modelname
dbt run --vars '{"VIEW_NAME": EEE, "SYSTEM_NAME": GGG, "ATTRIBUTE_NAME": III}' -m incremental_modelname
Re-usuable Incremental model:
{{
config(
materialized='incremental',
transient=false,
unique_key='composite_key',
post_hook="insert into table (col1, col2, col3)
select
'{{ var('ATTRIBUTE_NAME') }}',
col2,
col3
from {{ this }} a
join table b on a=b
where b.SYSTEM_NAME='{{ var('SYSTEM_NAME') }}';
commit;"
)
}}
with name1 AS (
select
*
from {{ var('VIEW_NAME') }}
),
select
*
from name1
{% if is_incremental() %}
where (select timestamp_column from {{ var('VIEW_NAME') }}) >
(select max(timestamp_column) from {{ this }} where SYSTEM_NAME='{{ var("SYSTEM_NAME") }}')
{% endif %}
The easiest way would be to:
Create a model(or even a seed) that holds the system name, view name and attribute name.
Within your code, add a for loop
{% set query %}
select system_name, view_name, attribute_name from model_name
{% endset %}
{% set results = run_query(query) %}
{% for result in results %}
/*
Put your query here but reference the columns needed
results.columns[0].values() = system_name
results.columns[1].values() = view_name
results.columns[2].values() = attribute_name
*/

Jinja and det for if else SQL statement

During the creation of a model in dbt I'm trying to construct an if else statement that has the following logic: if there is a table with the "table_name" name under the "project_name.dataset" then use this SELECT 1 else use SELECT 2
As I understand this should be something like this:
{% if "table_name" in run_query("
SELECT
table_name
FROM project-name.dataset.INFORMATION_SCHEMA.TABLES
").columns[0].values() %}
SELECT
1
{%.else %}
SELECT
2
{%.endif %}
This is by the way all happens in the BigQuery, that's why we use project-name.dataset.INFORMATION_SCHEMA.TABLES to extract the name of all the tables under this project and dataset.
But unfortunately this approach doesn't work. It would be really great if somebody could help me, please.
Here is how I did it:
{% set tables_list = [] %}
{%- for row in run_query(
"
SELECT
*
FROM project-name.dataset_name.INFORMATION_SCHEMA.TABLES
"
) -%}
{{ tables_list.append(row.values()[2]) if tables_list.append(row.values()[2]) is not none }}
{%- endfor -%}
{% if "table_name" in tables_list %}
SELECT logic 1
{% else %}
SELECT logic 2
{% endif %}

Retrieving table name from snowflake information_schema using dbt

I have created a macro to returns a table name from the INFORMATION_SCHEMA in Snowflake.
I have tables in snowflake as follows
------------
| TABLES |
------------
| ~one |
| ~two |
| ~three |
------------
I want to pass the table type i.e. one into the macro and get the actual table name i.e. ~one
Here is my macro(get_table.sql) in DBT that takes in parameter and returns the table name
{%- macro get_table(table_type) -%}
{%- set table_result -%}
select distinct TABLE_NAME from "DEMO_DB"."INFORMATION_SCHEMA"."TABLES" where TABLE_NAME like '\~%{{table_type}}%'
{%- endset -%}
{%- set table_name = run_query(table_result).columns[0].values() -%}
{{ return(table_name) }}
{%- endmacro -%}
Here is my DBT Model that calls the above macro
{{ config(materialized='table',full_refresh=true) }}
select * from {{get_table("one")}}
But I am getting an error:
Compilation Error in model
'None' has no attribute 'table'
> in macro get_table (macros\get_table.sql)
I don't understand where the error is
You need to use the execute context variable to prevent this error, as it is described here:
https://discourse.getdbt.com/t/help-with-call-statement-error-none-has-no-attribute-table/602
You also be careful about your query, that the table names are uppercase. So you better use "ilike" instead of "like".
Another important point is, "run_query(table_result).columns[0].values()" returns an array, so I added index to the end.
So here's the modified version of your module, which I successfully run it on my test environment:
{% macro get_table(table_name) %}
{% set table_query %}
select distinct TABLE_NAME from "DEMO_DB"."INFORMATION_SCHEMA"."TABLES" where TABLE_NAME ilike '%{{ table_name }}%'
{% endset %}
{% if execute %}
{%- set result = run_query(table_query).columns[0].values()[0] -%}
{{return( result )}}
{% else %}
{{return( false ) }}
{% endif %}
{% endmacro %}