DBT - Use dynamic array for Pivot - sql

I would like to use DBT to pivot a column in my BigQuery table.
Since I have more than 100 values, I want my pivoted column to be dynamic, I would like something like that:
select *
from ( select ceiu.value, ceiu.user_id, ce.name as name
from company_entity_item_user ceiu
left join company_entity ce on ce.id = ceiu.company_entity_id)
PIVOT(STRING_AGG(value) FOR name IN (select distinct name from company_entity))
The problem here is I can't use a SELECT statement inside IN.
I know I can use Jinja templates with DBT, it could look like this:
...
PIVOT(STRING_AGG(value) FOR name IN ('{{unique_company_entities}}'))
...
But I have no idea how to use a SELECT statement to create such variable.
Also, since I am using BigQuery, I tried using DECLARE and SET but I don't know how to use them in DBT, if it is even possible.
Thank for your help

Elevating the comment by #aleix-cc to an answer, because that is the best way to do this in dbt.
To get data from your database into the jinja context, you can use dbt's built-in run_query macro. Or if you don't mind using the dbt-utils package, you can use the get_column_values macro from that package, which will return a list of the distinct values in that column (it also guards the call to run_query with an {% if execute %} statement, which is critical to preventing dbt compilation errors).
Assuming company_entity is already a dbt model, your model becomes:
{% set company_list = dbt_utils.get_column_values(table=ref('company_entity'), column='name') %}
# company_list is a jinja list of strings. We need a comma-separated
# list of single-quoted string literals
{% set company_csv = "'" ~ company_list | join("', '") ~ "'" %}
select *
from ( select ceiu.value, ceiu.user_id, ce.name as name
from company_entity_item_user ceiu
left join company_entity ce on ce.id = ceiu.company_entity_id)
PIVOT(STRING_AGG(value) FOR name IN ({{ company_csv }})

Related

dbt source definition with dbt_utils get_relations_by_pattern and union_relations

I am creating unioned models of multiple tables that are split by year in our source (ex. TABLENAME2020, TABLENAME2021, etc) with the following code:
{% set relations = dbt_utils.get_relations_by_pattern(
database='DBNAME',
schema_pattern='SCHEMA',
table_pattern='TABLENAME%') %}
with source as (
select * from (
{{ dbt_utils.union_relations(relations=relations) }}
)
)
select * from source
This functions fine, however, I am wondering if there is a way to define the source in my sources.yml file that would correctly represent this in the dbt generated documentation?

How can I execute a custom function in Microsoft Visual FoxPro 9?

Using Microsoft Visual FoxPro 9, I have a custom function, "newid()", inside of the stored procedures for Main:
function newId
parameter thisdbf
regional keynm, newkey, cOldSelect, lDone
keynm=padr(upper(thisdbf),50)
cOldSelect=alias()
lDone=.f.
do while not lDone
select keyvalue from main!idkeys where keyname=keynm into array akey
if _tally=0
insert into main!idkeys (keyname) value (keynm)
loop
endif
newkey=akey+1
update main!idkeys set keyvalue=newkey where keyname=keynm and keyvalue=akey
if _tally=1
lDone=.t.
endif
enddo
if not empty(cOldSelect)
select &cOldSelect
else
select 0
endif
return newkey
This function is used to generate a new ID for records added to the database.
It is called as the default value:
I would like to call this newid() function and retrieve its returned value. When executing SELECT newid("TABLENAME"), the error is is thrown:
Invalid subscript reference
How can I call the newid() function and return the newkey in Visual FoxPro 9?
As an addition to what Stefan Wuebbe said,
You actually had your answer in your previous question here that you forgot to update.
From your previous question, as I understand you are coming from a T-SQL background. While in T-SQL (and in SQL generally) there is:
Select < anyVariableOrFunction >
that returns a single column, single row result, in VFP 'select' like that has another meaning:
Select < aliasName >
aliasName is an alias of a working area (or it could be number of a work area) and is used to change the 'current workarea'. When it was used in xBase languages like FoxPro (and dBase), those languages didn't yet meet ANSI-SQL if I am not wrong. Anyway, in VFP there are two Select, this one and SELECT—SQL which definitely requires a FROM clause.
VFP has direct access to variables and function calls though, through the use of = operator.
SELECT newid("TABLENAME")
in T-SQL, would be (you are just displaying the result):
? newid("TABLENAME")
To store it in a variable, you would do something like:
local lnId
lnId = newid("TABLENAME")
* do something with m.lnId
* Note the m. prefix, it is a built-in alias for memory variables
After having said all these, as per your code.
It looks like it has been written by a very old FoxPro programmer and I must admit I am seeing it the first time in my life that someone used "REGIONAL" keyword in VFP. It is from FoxPro 2.x days I know but I didn't see anyone use it up until now :) Anyway, that code doesn't seem to be robust enough in a multiuser environment, you might want to change it. VFP ships with a NewId sample code and below is the slightly modified version that I have been using in many locations and proved to be reliable:
Function NewID
Lparameters tcAlias,tnCount
Local lcAlias, lnOldArea, lcOldReprocess, lcTable, lnTagNo, lnNewValue, lnLastValue, lcOldSetDeleted
lnOldArea = Select()
lnOldReprocess = Set('REPROCESS')
* Uppercase Alias name
lcAlias = Upper(Iif(Parameters() = 0, Alias(), tcAlias))
* Lock reprocess - try once
Set Reprocess To 1
If !Used("IDS")
Use ids In 0
Endif
* If no entry yet create
If !Seek(lcAlias, "Ids", "tablename")
Insert Into ids (tablename, NextID) Values (lcAlias,0)
Endif
* Lock, increment id, unlock, return nextid value
Do While !Rlock('ids')
* Delay before next lock trial
lnStart = Seconds()
Do While Seconds()-lnStart < 0.01
Enddo
Enddo
lnLastValue = ids.NextID
lnNewValue = m.lnLastValue + Evl(m.tnCount,1)
*Try to query primary key tag for lcAlias
lcTable = Iif( Used(lcAlias),Dbf(lcAlias), Iif(File(lcAlias+'.dbf'),lcAlias,''))
lcTable = Evl(m.lcTable,m.lcAlias)
If !Empty(lcTable)
Use (lcTable) In 0 Again Alias '_GetPKKey_'
For m.lnTagNo=1 To Tagcount('','_GetPKKey_')
If Primary(m.lnTagNo,'_GetPKKey_')
m.lcOldSetDeleted = Set("Deleted")
Set Deleted Off
Select '_GetPKKey_'
Set Order To Tag (Tag(m.lnTagNo,'_GetPKKey_')) ;
In '_GetPKKey_' Descending
Locate
lnLastValue = Max(m.lnLastValue, Evaluate(Key(m.lnTagNo,'_GetPKKey_')))
lnNewValue = m.lnLastValue + Evl(m.tnCount,1)
If Upper(m.lcOldSetDeleted) == 'ON'
Set Deleted On
Endif
Exit
Endif
Endfor
Use In '_GetPKKey_'
Select ids
Endif
* Increment
Replace ids.NextID With m.lnNewValue In 'ids'
Unlock In 'ids'
Select (lnOldArea)
Set Reprocess To lnOldReprocess
Return ids.NextID
Endfunc
Note: If you use this, as I see from your code, you would need to change the "id table" name to idkeys, field names to keyname, keyvalue:
ids => idKeys
tablename => keyName
nextId => keyValue
Or in your database just create a new table with this code:
CREATE TABLE ids (TableName c(50), NextId i)
INDEX on TableName TAG TableName
When executing SELECT newid("TABLENAME")
The error: Invalid subscript reference is thrown
The SQL Select command in Vfp requires a From clause.
Running a procedure or a function can, or better usually needs to be done differently:
For example, in the IDE's Command Window you can do a
? newid("xy") && the function must be "in scope",
&& i.e in your case the database that contains the "Stored
&& Procedure" must have been opened in advance
&& or you store the function result in a variable
Local lnNextID
lnNextID = newid("xy")
Or you can use it in an SQL SELECT when you have a From alias
CREATE CURSOR placebo (col1 Int)
INSERT INTO placebo VALUES (8)
Select newid("xy") FROM placebo

Assign value of a column to variable in sql use jinja template language

I have a sql file like this to transform a table has a column include a json string
{{ config(materialized='table') }}
with customer_orders as (
select
time,
data as jsonData,
{% set my_dict = fromjson( jsonData ) %}
{% do log("Printout: " ~ my_dict, info=true) %}
from `warehouses.raw_data.customer_orders`
limit 5
)
select *
from customer_orders
When I run dbt run, it return like this:
Running with dbt=0.21.0
Encountered an error:
the JSON object must be str, bytes or bytearray, not Undefined
I even can not print out the value of column I want:
{{ config(materialized='table') }}
with customer_orders as (
select
time,
tag,
data as jsonData,
{% do log("Printout: " ~ data, info=true) %}
from `warehouses.raw_data.customer_orders`
limit 5
)
select *
from customer_orders
22:42:58 | Concurrency: 1 threads (target='dev')
22:42:58 |
Printout:
22:42:58 | Done.
But if I create another model to printout the values of jsonData:
{%- set payment_methods = dbt_utils.get_column_values(
table=ref('customer_orders_model'),
column='jsonData'
) -%}
{% do log(payment_methods, info=true) %}
{% for json in payment_methods %}
{% set my_dict = fromjson(json) %}
{% do log(my_dict, info=true) %}
{% endfor %}
It print out the json value I want
Running with dbt=0.21.0
This is log
Found 2 models, 0 tests, 0 snapshots, 0 analyses, 372 macros, 0 operations, 0 seed files, 0 sources, 0 exposures
21:41:15 | Concurrency: 1 threads (target='dev')
21:41:15 |
['{"log": "ok", "path": "/var/log/containers/...log", "time": "2021-10-26T08:50:52.412932061Z", "offset": 527, "stream": "stdout", "#timestamp": 1635238252.412932}']
{'log': 'ok', 'path': '/var/log/containers/...log', 'time': '2021-10-26T08:50:52.412932061Z', 'offset': 527, 'stream': 'stdout', '#timestamp': 1635238252.412932}
21:41:21 | Done.
But I want to process this jsonData with in a model file like customer_orders_model above.
How can I get value of a column and assign it to a variable and continue to process whatever I want (check if in json have a key I want and set it value to new column).
Notes: My purpose is that: In my table, has a json string column, I want extract this json string column into many columns so I can easily write sql query what I want.
In case BigQuery database, Google has a JSON functions in Standard SQL
If your column is JSON string, I think you can use JSON_EXTRACT to get value of the key you want
EX:
with customer_orders as (
select
time,
tag,
data as jsonData,
json_extract(data, '$.log') AS log,
from `dc-warehouses.raw_data.logs_trackfoe_prod`
limit 5
)
select *
from customer_orders
You are very close! The thing to remember is that dbt and jinja is primarily for rendering text. Anything that isn't in curly brackets is just text strings.
So in your first example, data and jsonData are a substring of the larger query (that is also a string). So they aren't variables that Jinja knows about, which explains the error message that they are Undefined
with customer_orders as (
select
time,
data as jsonData,
from `warehouses.raw_data.customer_orders`
limit 5
)
select *
from customer_orders
This is why dbt_utils.get_column_values() works for you because that macro actually runs a query to get the data and your assigning the result to a variable. the run_query macro can be helpful for situations like this (and i'm fairly certain get_column_values uses run_query in the background).
In regards to your original question, you want to turn a JSON dict into multiple columns, I'd first recommend having your db do this directly. Many dbs have functions that let you do this. Primarily jinja is for generating SQL queries dynamically, not for manipulating data. Even if you could load all the JSON into jinja, I don't know how you'd write that back into a table without using something like a INSERT INTO VALUES statement which, IMHO, goes against the design principle of dbt.

Perl DBI - binding a list

How do I bind a variable to a SQL set for an IN query in Perl DBI?
Example:
my #nature = ('TYPE1','TYPE2'); # This is normally populated from elsewhere
my $qh = $dbh->prepare(
"SELECT count(ref_no) FROM fm_fault WHERE nature IN ?"
) || die("Failed to prepare query: $DBI::errstr");
# Using the array here only takes the first entry in this example, using a array ref gives no result
# bind_param and named bind variables gives similar results
$qh->execute(#nature) || die("Failed to execute query: $DBI::errstr");
print $qh->fetchrow_array();
The result for the code as above results in only the count for TYPE1, while the required output is the sum of the count for TYPE1 and TYPE2. Replacing the bind entry with a reference to #nature (\#nature), results in 0 results.
The main use-case for this is to allow a user to check multiple options using something like a checkbox group and it is to return all the results. A work-around is to construct a string to insert into the query - it works, however it needs a whole lot of filtering to avoid SQL injection issues and it is ugly...
In my case, the database is Oracle, ideally I want a generic solution that isn't affected by the database.
There should be as many ? placeholders as there is elements in #nature, ie. in (?,?,..)
my #nature = ('TYPE1','TYPE2');
my $pholders = join ",", ("?") x #nature;
my $qh = $dbh->prepare(
"SELECT count(ref_no) FROM fm_fault WHERE nature IN ($pholders)"
) or die("Failed to prepare query: $DBI::errstr");

Django select only rows with duplicate field values

suppose we have a model in django defined as follows:
class Literal:
name = models.CharField(...)
...
Name field is not unique, and thus can have duplicate values. I need to accomplish the following task:
Select all rows from the model that have at least one duplicate value of the name field.
I know how to do it using plain SQL (may be not the best solution):
select * from literal where name IN (
select name from literal group by name having count((name)) > 1
);
So, is it possible to select this using django ORM? Or better SQL solution?
Try:
from django.db.models import Count
Literal.objects.values('name')
.annotate(Count('id'))
.order_by()
.filter(id__count__gt=1)
This is as close as you can get with Django. The problem is that this will return a ValuesQuerySet with only name and count. However, you can then use this to construct a regular QuerySet by feeding it back into another query:
dupes = Literal.objects.values('name')
.annotate(Count('id'))
.order_by()
.filter(id__count__gt=1)
Literal.objects.filter(name__in=[item['name'] for item in dupes])
This was rejected as an edit. So here it is as a better answer
dups = (
Literal.objects.values('name')
.annotate(count=Count('id'))
.values('name')
.order_by()
.filter(count__gt=1)
)
This will return a ValuesQuerySet with all of the duplicate names. However, you can then use this to construct a regular QuerySet by feeding it back into another query. The django ORM is smart enough to combine these into a single query:
Literal.objects.filter(name__in=dups)
The extra call to .values('name') after the annotate call looks a little strange. Without this, the subquery fails. The extra values tricks the ORM into only selecting the name column for the subquery.
try using aggregation
Literal.objects.values('name').annotate(name_count=Count('name')).exclude(name_count=1)
In case you use PostgreSQL, you can do something like this:
from django.contrib.postgres.aggregates import ArrayAgg
from django.db.models import Func, Value
duplicate_ids = (Literal.objects.values('name')
.annotate(ids=ArrayAgg('id'))
.annotate(c=Func('ids', Value(1), function='array_length'))
.filter(c__gt=1)
.annotate(ids=Func('ids', function='unnest'))
.values_list('ids', flat=True))
It results in this rather simple SQL query:
SELECT unnest(ARRAY_AGG("app_literal"."id")) AS "ids"
FROM "app_literal"
GROUP BY "app_literal"."name"
HAVING array_length(ARRAY_AGG("app_literal"."id"), 1) > 1
Ok, so for some reason none of the above worked for, it always returned <MultilingualQuerySet []>. I use the following, much easier to understand but not so elegant solution:
dupes = []
uniques = []
dupes_query = MyModel.objects.values_list('field', flat=True)
for dupe in set(dupes_query):
if not dupe in uniques:
uniques.append(dupe)
else:
dupes.append(dupe)
print(set(dupes))
If you want to result only names list but not objects, you can use the following query
repeated_names = Literal.objects.values('name').annotate(Count('id')).order_by().filter(id__count__gt=1).values_list('name', flat='true')