dbt source definition with dbt_utils get_relations_by_pattern and union_relations - dbt

I am creating unioned models of multiple tables that are split by year in our source (ex. TABLENAME2020, TABLENAME2021, etc) with the following code:
{% set relations = dbt_utils.get_relations_by_pattern(
database='DBNAME',
schema_pattern='SCHEMA',
table_pattern='TABLENAME%') %}
with source as (
select * from (
{{ dbt_utils.union_relations(relations=relations) }}
)
)
select * from source
This functions fine, however, I am wondering if there is a way to define the source in my sources.yml file that would correctly represent this in the dbt generated documentation?

Related

How to access BigQuery table metadata in DBT using jinja?

I'd like to access the last modified time column from the metadata of a BigQuery table that acts as a source. I want to create a generic test that checks if the last modified date of the source table is equal to today.
In BigQuery you can access this data in this way:
SELECT
last_modified_time
FROM `project.dataset.__TABLES__`
WHERE table_id = 'table_id'
My goal is to make the project.dataset dynamic depending on the model this test is applied to. Similarly, I'd like for table_id to be dynamic.
Given that DBT mentions on their documentation that the dataset of BigQuery is similar in definition to 'schema', I tried this but it didn't work.
{% test last_modified_time(schema, model) %}
SELECT
last_modified_time
FROM `{{ database }}.{{ schema }}.__TABLES__`
WHERE table_id = {{ model }}
{% endtest %}
What this does is it renders the project name for both database and schema.
Also, model will (of course) render the project.dataset.table_id path while I only need the table_id.
I'm fairly new to DBT but I couldn't find anything that resembles what I'm looking for.
I tinkered with your solution for a little bit and this works flawlessly.
Thank you so much!
{% test last_modified_time(model) %}
WITH t AS (
SELECT DATE(TIMESTAMP_MILLIS(last_modified_time)) AS lmt
FROM `{{ model.database }}.{{ model.schema }}.__TABLES__`
WHERE table_id = '{{ model.identifier }}'
)
SELECT
lmt
FROM t
WHERE lmt < CURRENT_DATE()
{% endtest %}
There are a few changes you need to make:
Your generic test is accepting an argument named schema, which dbt won't provide when you execute the test. The test should only accept model, and then you'll want to config your yaml file so the test is on the model (not on a column):
models:
- name: my_model
tests:
- last_modified_time
The model argument is a Relation, and you can use that to grab the database/project, schema/dataset, and identifier (the materialized name) of the model.
Tests fail if they return any records. So your test will always fail. You need to compare the last_modified_time to the current date, and only return records that are older than the current date.
Putting that all together:
{% test last_modified_time(model) %}
with t as (
SELECT last_modified_time
FROM `{{ model.database }}.{{ model.schema }}.__TABLES__`
WHERE table_id = '{{ model.identifier }}'
)
select *
from t
where t.last_modified_time < current_date()
{% endtest %}

DBT - Use dynamic array for Pivot

I would like to use DBT to pivot a column in my BigQuery table.
Since I have more than 100 values, I want my pivoted column to be dynamic, I would like something like that:
select *
from ( select ceiu.value, ceiu.user_id, ce.name as name
from company_entity_item_user ceiu
left join company_entity ce on ce.id = ceiu.company_entity_id)
PIVOT(STRING_AGG(value) FOR name IN (select distinct name from company_entity))
The problem here is I can't use a SELECT statement inside IN.
I know I can use Jinja templates with DBT, it could look like this:
...
PIVOT(STRING_AGG(value) FOR name IN ('{{unique_company_entities}}'))
...
But I have no idea how to use a SELECT statement to create such variable.
Also, since I am using BigQuery, I tried using DECLARE and SET but I don't know how to use them in DBT, if it is even possible.
Thank for your help
Elevating the comment by #aleix-cc to an answer, because that is the best way to do this in dbt.
To get data from your database into the jinja context, you can use dbt's built-in run_query macro. Or if you don't mind using the dbt-utils package, you can use the get_column_values macro from that package, which will return a list of the distinct values in that column (it also guards the call to run_query with an {% if execute %} statement, which is critical to preventing dbt compilation errors).
Assuming company_entity is already a dbt model, your model becomes:
{% set company_list = dbt_utils.get_column_values(table=ref('company_entity'), column='name') %}
# company_list is a jinja list of strings. We need a comma-separated
# list of single-quoted string literals
{% set company_csv = "'" ~ company_list | join("', '") ~ "'" %}
select *
from ( select ceiu.value, ceiu.user_id, ce.name as name
from company_entity_item_user ceiu
left join company_entity ce on ce.id = ceiu.company_entity_id)
PIVOT(STRING_AGG(value) FOR name IN ({{ company_csv }})

DBT macro for repetitive task

I am a beginner in DBT. I have a requirement where I have created an Incremental model like below. I need to execute the same Incremental model logic statements for different systems.
There are 3 variables or parameters that I need to pass. i.e. for each run the ATTRIBUTE_NAME, VIEW_NAME, SYSTEM_NAME will need to be passed. For the next run, all the 3 parameters would be different.
However, for a particular SYSTEM_NAME, the VIEW_NAME and ATTRIBUTE_NAME are fixed.
Please help me to execute the dbt run using a macro for this requirement and pass the different system names and their corresponding view names and attribute names. Objective is to use single dbt run statement and execute this model for all ATTRIBUTE_NAME, VIEW_NAME, SYSTEM_NAME.
For now, I have defined variable and execute each run separately for each systems like below in CLI
e.g.
dbt run --vars '{"VIEW_NAME": CCC, "SYSTEM_NAME": BBBB, "ATTRIBUTE_NAME": AAAA}' -m incremental_modelname
dbt run --vars '{"VIEW_NAME": DDD, "SYSTEM_NAME": FFF, "ATTRIBUTE_NAME": HHH}' -m incremental_modelname
dbt run --vars '{"VIEW_NAME": EEE, "SYSTEM_NAME": GGG, "ATTRIBUTE_NAME": III}' -m incremental_modelname
Re-usuable Incremental model:
{{
config(
materialized='incremental',
transient=false,
unique_key='composite_key',
post_hook="insert into table (col1, col2, col3)
select
'{{ var('ATTRIBUTE_NAME') }}',
col2,
col3
from {{ this }} a
join table b on a=b
where b.SYSTEM_NAME='{{ var('SYSTEM_NAME') }}';
commit;"
)
}}
with name1 AS (
select
*
from {{ var('VIEW_NAME') }}
),
select
*
from name1
{% if is_incremental() %}
where (select timestamp_column from {{ var('VIEW_NAME') }}) >
(select max(timestamp_column) from {{ this }} where SYSTEM_NAME='{{ var("SYSTEM_NAME") }}')
{% endif %}
The easiest way would be to:
Create a model(or even a seed) that holds the system name, view name and attribute name.
Within your code, add a for loop
{% set query %}
select system_name, view_name, attribute_name from model_name
{% endset %}
{% set results = run_query(query) %}
{% for result in results %}
/*
Put your query here but reference the columns needed
results.columns[0].values() = system_name
results.columns[1].values() = view_name
results.columns[2].values() = attribute_name
*/

Assign value of a column to variable in sql use jinja template language

I have a sql file like this to transform a table has a column include a json string
{{ config(materialized='table') }}
with customer_orders as (
select
time,
data as jsonData,
{% set my_dict = fromjson( jsonData ) %}
{% do log("Printout: " ~ my_dict, info=true) %}
from `warehouses.raw_data.customer_orders`
limit 5
)
select *
from customer_orders
When I run dbt run, it return like this:
Running with dbt=0.21.0
Encountered an error:
the JSON object must be str, bytes or bytearray, not Undefined
I even can not print out the value of column I want:
{{ config(materialized='table') }}
with customer_orders as (
select
time,
tag,
data as jsonData,
{% do log("Printout: " ~ data, info=true) %}
from `warehouses.raw_data.customer_orders`
limit 5
)
select *
from customer_orders
22:42:58 | Concurrency: 1 threads (target='dev')
22:42:58 |
Printout:
22:42:58 | Done.
But if I create another model to printout the values of jsonData:
{%- set payment_methods = dbt_utils.get_column_values(
table=ref('customer_orders_model'),
column='jsonData'
) -%}
{% do log(payment_methods, info=true) %}
{% for json in payment_methods %}
{% set my_dict = fromjson(json) %}
{% do log(my_dict, info=true) %}
{% endfor %}
It print out the json value I want
Running with dbt=0.21.0
This is log
Found 2 models, 0 tests, 0 snapshots, 0 analyses, 372 macros, 0 operations, 0 seed files, 0 sources, 0 exposures
21:41:15 | Concurrency: 1 threads (target='dev')
21:41:15 |
['{"log": "ok", "path": "/var/log/containers/...log", "time": "2021-10-26T08:50:52.412932061Z", "offset": 527, "stream": "stdout", "#timestamp": 1635238252.412932}']
{'log': 'ok', 'path': '/var/log/containers/...log', 'time': '2021-10-26T08:50:52.412932061Z', 'offset': 527, 'stream': 'stdout', '#timestamp': 1635238252.412932}
21:41:21 | Done.
But I want to process this jsonData with in a model file like customer_orders_model above.
How can I get value of a column and assign it to a variable and continue to process whatever I want (check if in json have a key I want and set it value to new column).
Notes: My purpose is that: In my table, has a json string column, I want extract this json string column into many columns so I can easily write sql query what I want.
In case BigQuery database, Google has a JSON functions in Standard SQL
If your column is JSON string, I think you can use JSON_EXTRACT to get value of the key you want
EX:
with customer_orders as (
select
time,
tag,
data as jsonData,
json_extract(data, '$.log') AS log,
from `dc-warehouses.raw_data.logs_trackfoe_prod`
limit 5
)
select *
from customer_orders
You are very close! The thing to remember is that dbt and jinja is primarily for rendering text. Anything that isn't in curly brackets is just text strings.
So in your first example, data and jsonData are a substring of the larger query (that is also a string). So they aren't variables that Jinja knows about, which explains the error message that they are Undefined
with customer_orders as (
select
time,
data as jsonData,
from `warehouses.raw_data.customer_orders`
limit 5
)
select *
from customer_orders
This is why dbt_utils.get_column_values() works for you because that macro actually runs a query to get the data and your assigning the result to a variable. the run_query macro can be helpful for situations like this (and i'm fairly certain get_column_values uses run_query in the background).
In regards to your original question, you want to turn a JSON dict into multiple columns, I'd first recommend having your db do this directly. Many dbs have functions that let you do this. Primarily jinja is for generating SQL queries dynamically, not for manipulating data. Even if you could load all the JSON into jinja, I don't know how you'd write that back into a table without using something like a INSERT INTO VALUES statement which, IMHO, goes against the design principle of dbt.

VFP. Re-creating indexes

I would like to be able to re-create the index to a table which is part of a database.
The index to the table is corrupted, so the table cannot be opened (certainly not without an error message). I would like to use the information in the database container to reconstruct the index (with its several tags)
Perhaps I could offer the user a facility to browse a list of tables in the database and then choose one table (or maybe all) to be re-indexed or packed. Have looked at DBGETPROP() and CURSORGETPROP() function calls, but have not found options which give me the information I need.
As always, grateful for guidance.
Thanks Tamar. Yes, I had been opening the .dbc, but then wanted the tag names and index expressions.
What I would like to do is build up a cursor with details of the tag names and index expressions for each table in the database.
As far as I can see I need to get the names of the tables from the .dbc and then open each table to find its indexes . I think these are only in the tables themselves not in the .dbc. Can see that the SYS(14) function will give me the index expressions and that TAG() will give me the tag names; Is that the way to go, or is there some function already available which will do these things for me?
The database container is a table, so you can open it with USE to read the data there.
You shouldn't rely on dbc to find and to do anything afterwards. It doesn't have the index information anyway (except for some like PK, field is indexed etc which is not much of a value). Sometimes, opening the database exclusively (and have access to its tables exclusively) and then doing a:
validate database recover
helps but you can't rely on it either. To correctly recreate the indexes, you have to delete them all (aka delete the CDX files) and then recreate from scratch.
If it helps, below is the utility code I have written and using in "repair" in case needed. It creates a kind of data dictionary for both Dbc and free tables:
* CreateDictionary.prg
* Author: Cetin Basoz
* CreateDictionary('c:\mypath\v210\data','','zipcodes,states')
* Creates DataDictionary files in 'DataDic' directory (Default if not specified)
* using 'c:\mypath\v210\data' dir as source data dir
* adds zipcodes and states.dbf as static files (with data as is)
* tcUserStatic - tables that should be kept on user as is
Lparameters tcDataDir, tcDicDir, tcStaticList, tcUserStaticList
tcDataDir = iif(type('tcDataDir')='C' and directory(addbs(m.tcDataDir)), addbs(m.tcDataDir), sys(5)+curdir())
tcDicDir = iif(type('tcDicDir')='C' and !empty(m.tcDicDir), addbs(m.tcDicDir), 'DataDic\')
tcStaticList = iif(Type('tcStaticList')='C',trim(m.tcStaticList),'')
If !directory(justpath(m.tcDicDir))
Md (justpath(m.tcDicDir))
Endif
Close data all
lnDatabases = adir(arrDBC,m.tcDataDir+'*.dbc')
Create table (m.tcDicDir+'DBCreator') (DBCName M nocptrans, FileBin M nocptrans, Filename c(128) nocptrans)
for ix = 1 to m.lnDatabases
open data (m.tcDataDir+arrDBC[m.ix,1])
do home()+'Tools\Gendbc\gendbc' with forceext(m.tcDicDir+arrDBC[m.ix,1],'PRG')
compile (forceext(m.tcDicDir+arrDBC[m.ix,1],'PRG'))
insert into (m.tcDicDir+'DBCreator') ;
values (arrDBC[m.ix,1], ;
FileToStr(forceext(m.tcDicDir+arrDBC[m.ix,1],'FXP')), ;
forceext(arrDBC[m.ix,1],'FXP'))
erase (forceext(m.tcDicDir+arrDBC[m.ix,1],'PRG'))
erase (forceext(m.tcDicDir+arrDBC[m.ix,1],'FXP'))
if file(forceext(m.tcDicDir+arrDBC[m.ix,1],'KRT'))
insert into (m.tcDicDir+'DBCreator') ;
values (arrDBC[m.ix,1], ;
FileToStr(forceext(m.tcDicDir+arrDBC[m.ix,1],'KRT')),;
forceext(arrDBC[m.ix,1],'KRT'))
erase (forceext(m.tcDicDir+arrDBC[m.ix,1],'KRT'))
endif
endfor
Close data all
Create cursor crsSTRUCTS ;
(FIELD_NAME C(128) nocptrans, ;
FIELD_TYPE C(1), ;
FIELD_LEN N(3, 0), ;
FIELD_DEC N(3, 0), ;
FIELD_NULL L, ;
FIELD_NOCP L, ;
_TABLENAME M nocptrans)
Create cursor crsINDEXES ;
(TAG_NAME C(10) nocptrans, ;
KEY_EXPR M, ;
NDXTYPE C(1), ;
IS_DESC L, ;
FILTEREXPR M nocptrans, ;
_TABLENAME M nocptrans)
Select 0
lnTables = adir(arrTables,m.tcDataDir+'*.dbf')
For ix=1 to m.lnTables
Use (m.tcDataDir+arrTables[m.ix,1])
if empty(cursorgetprop('Database'))
lnFields=afields(arrStruc)
For jx=1 to m.lnFields
arrStruc[m.jx,7]=arrTables[m.ix,1]
Endfor
Insert into crsSTRUCTS from array arrStruc
Release arrStruc
If tagcount()>0
Dimension arrIndexes[tagcount(),6]
For jx=1 to tagcount()
arrIndexes[m.jx,1] = tag(m.jx)
arrIndexes[m.jx,2] = key(m.jx)
arrIndexes[m.jx,3] = iif(Primary(m.jx),'P',iif(Candidate(m.jx),'C',iif(unique(m.jx),'U','R')))
arrIndexes[m.jx,4] = descending(m.jx)
arrIndexes[m.jx,5] = sys(2021,m.jx)
arrIndexes[m.jx,6] = arrTables[m.ix,1]
Endfor
Insert into crsINDEXES from array arrIndexes
Endif
endif
Use
Endfor
Select crsSTRUCTS
Copy to (m.tcDicDir+'NewStruc')
Select crsINDEXES
Copy to (m.tcDicDir+'NewIndexes')
Create table (m.tcDicDir+'static') (FileName M nocptrans, FileBin M nocptrans)
If !empty(m.tcStaticList)
lnStatic = alines(arrStatic,chrtran(m.tcStaticList,',',chr(13)))
For ix = 1 to m.lnStatic
lnFiles = adir(arrFiles,m.tcDataDir+trim(arrStatic[m.ix])+'.*')
For jx=1 to m.lnFiles
If inlist(justext(arrFiles[m.jx,1]),'DBF','CDX','FPT')
Insert into (m.tcDicDir+'static') values ;
(arrFiles[m.jx,1], FileToStr(m.tcDataDir+arrFiles[m.jx,1]))
Endif
Endfor
Release arrFiles
Endfor
Endif
CREATE TABLE (m.tcDicDir+'userstatic') (FileName c(50))
If !empty(m.tcUserStaticList)
lnUserStatic = alines(arrUserStatic,chrtran(m.tcUserStaticList,',',chr(13)))
For ix = 1 to m.lnUserStatic
lnFiles = adir(arrFiles,m.tcDataDir+trim(arrUserStatic[m.ix])+'.*')
For jx=1 to m.lnFiles
If inlist(justext(arrFiles[m.jx,1]),'DBF','CDX','FPT')
Insert into (m.tcDicDir+'userstatic') values (arrFiles[m.jx,1])
Endif
Endfor
Release arrFiles
Endfor
Endif
close data all
close tables all