dbt: sql_header macro limitations vs query-comment - dbt

Dbt has a configuration setting for sql_header that ostensibly is for injecting udf's at runtime into a model statement. Unfortunately, it seems calling a macro is unsupported. In addition, ephemeral materializations are un-impacted by this setting. I created a setting called sql_footer but at the end of a sql statement and has similarly limitations.
Would it be reasonable tweak the query_header code to support injecting raw sql in addition to comment blocks, say by adding an execution boolean to the config dictionary?
dbt/core/dbt/adapters/base/query_headers.py
def add(self, sql: str) -> str:
if not self.query_comment:
return sql
if self.append:
# replace last ';' with '<comment>;'
sql = sql.rstrip()
if sql[-1] == ';':
sql = sql[:-1]
return '{}\n{} {} {};'.format(sql, block_start, self.query_comment.strip(), block_end)
vs
return '{}\n/* {} */;'.format(sql, self.query_comment.strip())
I understand any reticence to injecting sql into sql, my use-cases are very much system level configurations that a model developer would never come into contact with and would ideally be controlled through cicd. Our etl has different implementations that require different staging filters depending on the environments. I'd prefer to inject a line or two of sql rather than having to duplicate models for each implementation.
for ex:
dbt_project.yml
models:
- foo:
query_comment:
comment: "{{ var('ops_filter', default_filter()) }}"
executable: True
append: True
stg_foo.sql
with source as (Select *
from {{ source('foo') }})
select id
from source
### inject footer sql here ###
where $date_param between dbt_valid_to and dbt_valid_from
|where 1=1
|where dms_updated_at::date=$date_param```
Any advice is appreciated, love this project!

Based on your use case, it sounds like you're interested in functionality along the lines of this older issue:
https://github.com/fishtown-analytics/dbt/issues/1096. We closed that issue in May due to lack of interest from the community, but that doesn't mean that people don't run into this problem (and dbtonic answers for it) today.
As I see it, the best answer is to include a macro {{ footer_sql() }} at the bottom of your models, which could then dynamically include (or not) your environment-specific logic:
{% macro footer_sql(date_param) %}
{% if target.name == 'ci' %}
where {{ date_param }} between dbt_valid_to and dbt_valid_from
{% elif target.name == 'prod' %}
where 1=1
{% elif target.name == 'dev' %}
where dms_updated_at::date= {{ date_param }}
{% endif %}
{% endmacro %}
Last but not least, I just want to address a few of the things you mentioned:
Unfortunately, it seems calling a macro is unsupported.
You can absolutely include Jinja macros in set_sql_header calls, as long as those macros compile to SQL. This is how many users create UDFs on BigQuery.
In addition, ephemeral materializations are un-impacted by this setting.
That's correct. The purpose of SQL headers is to interpolate SQL that will precede the create view as/create table as DDL; since ephemeral models aren't materialized as database objects, they have no DDL to precede.

Related

How can I reference a table in dbt using its alias and a var, not its resource name?

I have been able to create a reasonably complex dbt model which contains several models all of which rely on a single model that acts as a filter.
Broadly, the numerous models follow the pattern:
{{ config(materialized = 'view') }}
SELECT
*
FROM
TABLE
INNER JOIN
{{ ref('filter_table') }} FILTER
ON
TABLE.KEY = FILTER.KEY
The filter table, let's imagine it's called filter_table.sql is simply:
{{ config(materialized = 'view') }}
SELECT
*
FROM
FILTER_SOURCE
WHERE
RELEVANT = True
This works fine when I reference it in the numerous models like this: {{ ref('filter_table') }}.
However, when I try to use an alias in the filter table it seems that the alias is not resolved in time for dbt to be able to recognise it.
I amend the config of filter_table.sql to this...
{{ config(materialized = 'view', alias = 'FILT') }}
...and the references in the dependant models like this...
{{ ref(var('filter_table_alias')) }}
...with a var in dbt_project.yml set like this:
vars:
filter_table_alias: 'FILT'
I get a message though which states that the node named 'FILT' is not found.
So my working theory is that although dbt recognised the dependencies based on how the refs are set up it is not able to do this using an alias - presumably the alias is not processed by the time that it is setting up the graph.
Is there a quick way to set up the alias and force it to be loaded first?
Or am I barking up the wrong tree?
The alias only impacts the name of the relation where the model is materialized in your database. ref always takes a model name, not an alias.
So you can add an alias = 'FILT' config to your filter table if you want, but in the other models you must continue to ref('filter_table').
The reason for this distinction is that dbt model names must be unique (within a dbt package/project), but aliases need not be unique (if they are materialized to different schemas).
You might be able to take advantage of dbt Classing - check out api.Relation, in which the identifier could be set as the alias I believe...

dbt post hook relation "my_table" does not exist

I am building some models using dbt.
I have a model so -
SELECT
COALESCE(
col1, col2
) AS col,
....
FROM
{{ source(
'db',
'tbl'
) }}
WHERE ....
This model has a config section calling a macro
{{- config(
post_hook = [macro()],
materialized='table'
) -}}
Within the macro I use {% if execute %} and I also log to check the execute value {{ log('Calling update macro with exec value = ' ~ execute) }}
When I run dbt compile I do not expect the macro to fire according to the documentation. However, it does and actually sets the execute to true triggering the update and causing on error as the table doesn't exist. Am I missing something or is this a dbt bug? I am confused!
Here's the line from the logs -
2021-09-15 20:48:16.864555 (Thread-1): Calling update macro with exec value = True
.. and the error is
relation "schema.my_table" does not exist
Appreciate any pointers someone might have, thanks
Ok, so here's what I found out about dbt.
When you dbt compile or dbt run the first time, the tables do not exist in the database yet. However, both compile and run will check the db objects exist and throw an error otherwise. So, my select within the macro failed irrespective of me using {% if execute %}
I called the adapter.get_relation() to check if the table exists -
{%- set source_relation = adapter.get_relation(
database=this.database ,
schema=this.schema,
identifier=this.name) -%}
and used the check condition -
{% set table_exists=source_relation is not none %}
For an incremental run, the fix was easier -
{% if execute and is_incremental() %}
Now, my code is fixed :)

How to parse a variable as a source reference in dbt?

I am building a model where I am dynamically referencing the table name and schema name based on the results of a query.
{%- set query %}
select * from master_data.event_metadata where game='{{game_name}}'
{% endset -%}
{%- set results = run_query(query) -%}
{%- if execute %}
{%- set record = results.rows[0] -%}
{% else %}
{%- set record = [] -%}
{% endif -%}
Two of the values are in record.SCHEMA_NAME and record.TABLE_NAME. I can use something like
select
*
from
{{record.SCHEMA_NAME}}.{{record.TABLE_NAME}}
but I'd rather use the source() function instead so that my documentation and DAG will be clean. How can I parse record.SCHEMA_NAME and record.TABLE_NAME as string arguments. I need to have something like
select
*
from
{{ source(record.SCHEMA_NAME, record.TABLE_NAME) }}
When I try to run the above I get the below error:
Server error: Compilation Error in rpc request (from remote system)
The source name (first) argument to source() must be a string, got <class 'jinja2.runtime.Undefined'>
You might already have found a workaround or a solution for this, but just in case someone else comes to the same situation...
To convert the values to string you can use the |string. For instance:
record.SCHEMA_NAME|string
record.TABLE_NAME|string
So your query would look something like this:
select * from {{ source(record.SCHEMA_NAME|string|lower, record.TABLE_NAME|string|lower) }}
Note that depending on the output for your query and how you defined the source file, you might have to lower or upper your values to match with your source.
Problem
Your record variable is a result of an execution (run_query(query)). When you do dbt compile/run dbt will do a series of operations like read all the files of your project, generate a "manifest.json" file, and will use the ref/source to generate the DAG so at this point, no SQL is executed, in other words, execute == False.
In your example, even if you do record.SCHEMA_NAME|string you will not be able to retrieve the value of that variable because nothing was executed and since you did if not execute then record = [] , you will get that message ... depends on a source named '.' which was not found, because at that point, record is empty.
A workaround would be to wrap your model's query in a if execute block, something like this:
{% if execute %}
select * from {{ source(record.TABLE_SCHEMA|string|lower, record.TABLE_NAME|string|lower) }}
{% endif %}
With that approach, you will be able to dynamically set the source of your model.
But unfortunately, this will not work as you expected because it will not generate the DAG for that model. Using an if execute block to wrap your model's query will prevent dbt to generate the DAG for the model.
In the end, this would be the same as your first attempt on having the schema and table declared without the source function.
For more details, you can check the dbt documentation about the execute mode:
https://docs.getdbt.com/reference/dbt-jinja-functions/execute
I think you need to convert that two objects into their string representation first before passing them to the source macro.
Try this
select
*
from
{{ source(record.SCHEMA_NAME|string, record.TABLE_NAME||string) }}

DBT 'dbt snapshot' command resulting in error: "Database Error in snapshot snapshot_name Unrecognized name: id at [53:13]"

As the question says, I am running the dbt snapshot command and a few of my snapshots are not working because DBT is not recognizing the surrogate key id that I created. My snapshots are all built the same way and so are the base views that they are based off of. Here is an example of a snapshot that is not working because it does not recognize the surrogate key:
{% snapshot example_snapshot %}
{{ config(
target_schema = 'snapshots',
unique_key = 'id',
strategy = 'check',
check_cols = 'all'
) }}
SELECT
*
FROM
{{ ref('base_example') }}
{% endsnapshot %}
followed by an example of the base view it is referencing:
WITH src AS (
SELECT
*
FROM
{{ source(
'tomato',
'potato'
) }}
),
cleaned AS (
SELECT
*,
{{ dbt_utils.surrogate_key(['column', 'another_column', 'yet_another_column']) }} AS id
FROM
src
)
SELECT
*
FROM
cleaned
Keep in mind that when I run the command dbt run -m [base_example] it produces a view where I can see the hash generated as a surrogate key. The issue is only when I run dbt snapshot. In fact, running dbt snapshot --select [example_snapshot] to only run one snapshot at a time doesn't give me any errors for any of the snapshots. The most confusing part: I have one base view and snapshot of that base view configured exactly as the other 3 that are not working, yet it recognizes the surrogate key when creating a snapshot. I'm seriously stumped, any help would be appreciated.
In my experience, snapshots can get a little wonky when dependent on a dbt model (via ref('base_example')) and not a source. It's advisable, though without great explanation in the docs as to why, to select from the source.
Since your transformation is only about adding a surrogate key based on three columns in the source, I wonder if you could just stick the transformation in the unique_key parameter, รก la (here thinking in Redshift land and without testing):
{% snapshot example_snapshot %}
{{ config(
target_schema = 'snapshots',
unique_key = 'md5(column, another_column, yet_another_column)',
strategy = 'check',
check_cols = 'all'
) }}
SELECT
*
FROM
{{ source('tomato', 'potato') }}
{% endsnapshot %}

How to avoid repeating myself in Salt states?

We have two different environments, dev and production, managed by a single Salt server.
Something like this:
base:
'dev-*':
- users-dev
'prod-*':
- users-prod
user-dev and users-prod states are pretty much the same, like this:
{% for user, data in pillar['users-dev'].items() %}
{{ user }}-user:
user.present:
< ...something... >
{{ user }}_ssh_auth:
ssh_auth.present:
< ...something... >
{% endfor %}
We did not want to duplicate the code so our initial idea was to do something like this:
{% users = pillar['users'].items() %}
include:
- users-common
and then to refer to users in users-common, but this did not work because the proper Jinja syntax was set users = pillar['users'].items() and this was not intended to work across Salt states includes.
So, the question is how to do it properly?
All jinja is evaluated before any of the states (including the include statements) are evaluated.
However, I would think you would just be able to refer directly to pillar['users'].items() inside of users-common. Is it not allowing you to access pillar from within that state?