Using DBT to create Join queries, but result omits some columns - sql

I have the following code that uses a right join to connect my data from Table 1 to Table 2. DBT compiled the code successfully without errors but I'm not getting the columns I need...
{{
config(
materialized='incremental'
)
}}
with incremental_salesorder as (
select * from {{ source('db_warehouse', 'sale_order_line') }}
),
final as (
select
distinct incremental_salesorder.product_code_cust,
incremental_salesorder.order_id as id,
incremental_salesorder.create_date,
incremental_salesorder.name as product_name,
incremental_salesorder.product_name_cust,
sale_order.name as sale_order_ref
from incremental_salesorder
right join {{ source('db_warehouse', 'sale_order')}} using (id)
ORDER BY incremental_salesorder.create_date
)
{% if is_incremental() %}
where incremental_salesorder.create_date >= (select max(create_date) from {{ this }} )
{% endif %}
select * from final
incremental_salesorder.order_id and incremental_salesorder.name are not in the results after the code compiled successfully
What am I doing wrong here... ?

Rookie mistake:
Ensure that the defined model name is the same:
models:
dbt_test:
# Applies to all files under models/example/
example:
materialized: view
+schema: staging
+enabled: false
sales_order_unique_incremental: <- this line must match the folder name
materialized: table
+schema: datastudio
I completely missed the warning. Once this was corrected I was able to compile the query and got the results I needed. In case anyone needs an example of how to do a join, this is a working method :)

Related

Replicate a case when statement in Jinja

I want to replicate a simple case-when statement with a jinja block in dbt.
How can I achieve this?
Here is my statement:
CASE status
WHEN 0 THEN 'pending'
WHEN 1 THEN 'ordered'
WHEN 2 THEN 'shipped'
WHEN 3 THEN 'received'
WHEN 4 THEN 'delivered'
ELSE NULL
END as status_mapping
You have a couple options. First, you can define the mappings in an array or dict (if the ids are not a sequence) and loop through it to produce the full case statement:
{% set statuses = ['pending', 'ordered', 'shipped', 'received', 'delivered'] %}
CASE STATUS
{% for status in statuses %}
WHEN {{ loop.index - 1 }} THEN '{{ status }}'
{% endfor %}
ELSE NULL END STATUS_MAPPING
The other option is to put the mappings into a CSV, load it as a seed data file in DBT (https://docs.getdbt.com/docs/build/seeds), then join with the seed data as a ref.
Create a file called status_mappings.csv:
status_code,status
0,pending
1,ordered
2,shipped
3,received
4,delivered
Run dbt seed, then add
WITH STATUS_MAPPINGS AS (
SELECT * FROM {{ ref('status_mappings') }}
}
SELECT S.STATUS
FROM MY_TABLE T1
JOIN STATUS_MAPPINGS SM ON T1.STATUS_CODE = SM.STATUS_CODE
You can use a macro to insert reusable SQL snippets across different queries, which is one possible reason you might want to do this.
You could define the macro as follows:
-- yourproject/macros/status_mapping.sql
{% macro status_mapping(status) %}
CASE {{ status }}
WHEN 0 THEN 'pending'
WHEN 1 THEN 'ordered'
WHEN 2 THEN 'shipped'
WHEN 3 THEN 'received'
WHEN 4 THEN 'delivered'
ELSE NULL
END
{% endmacro %}
(I have kept the definition flexible)
... and call it in a model e.g. as follows:
-- yourproject/models/base/base__orders.sql
SELECT
order_id,
status_code,
{{ status_mapping('status_code') }} AS status
FROM
{{ source('your_dataset', 'orders') }}
Note the use of quotes around the field name, same as with the built-in source macro two lines below. By including the field name as a macro argument instead of hard-coding it (and keeping the aliasing AS status outside the macro) you allow yourself flexibility to change things in future.
This would then be compiled when you run DBT to something like:
SELECT
order_id,
status_code,
CASE status_code
WHEN 0 THEN 'pending'
WHEN 1 THEN 'ordered'
WHEN 2 THEN 'shipped'
WHEN 3 THEN 'received'
WHEN 4 THEN 'delivered'
ELSE NULL
END AS status
FROM
your_dataset.orders

SQL join query and with emtpy fields or with fields containing a text

im trying to build a select query with fields from two tables and also other fields which do not exist in my DB. E.g. emtpy fields or fields with particular text like 'not available'
for example my query looks like this:
select
Product.id,
Product.component_number,
Product_discount.price,
from Product
left join Product_discount ON Product_discount.id = Product.id
where
Product.deleted = 0
Now in the same query I want to add fields that do not exist like:
Codenumber >> should always be emtpy
Factor >> should always contain the text 'Yes' for every row
idea is to build a query that is identical to an import file
Thanks...
You can try this:
select
Product.id,
Product.component_number,
Product_discount.price,
'' as Codenumber,
'Yes' as Factor
from Product
left join Product_discount ON Product_discount.id = Product.id
where Product.deleted = 0;
Thank you

Airflow: how can i automate such that a query runs for every date specified rather than hard coding?

I am new to airflow so apoliges if this has been asked somewhere.
I have a query i run in hive that is partitioned on year month so e.g. 202001.
how can i run a query which specifies a variable for different values within the query in airflow? eg. taking this example
from airflow import DAG
from airflow.operators.mysql_operator import MySqlOperator
default_arg = {'owner': 'airflow', 'start_date': '2020-02-28'}
dag = DAG('simple-mysql-dag',
default_args=default_arg,
schedule_interval='00 11 2 * *')
mysql_task = MySqlOperator(dag=dag,
mysql_conn_id='mysql_default',
task_id='mysql_task'
sql='<path>/sample_sql.sql',
params={'test_user_id': -99})
where my sample_sql.hql looks like:
ALTER TABLE sample_df DROP IF EXISTS
PARTITION (
cpd_ym = ${ym}
) PURGE;
INSERT INTO sample_df
PARTITION (
cpd_ym = ${ym}
)
SELECT
*
from sourcedf
;
ANALYZE TABLE sample_df
PARTITION (
cpd_ym = ${ym}
)
COMPUTE STATISTICS;
ANALYZE TABLE sample_df
PARTITION (
cpd_ym = ${ym}
)
COMPUTE STATISTICS FOR COLUMNS;
i want to run the above for different values of ym using airflow e.g. between 202001 and 202110 how can i do this?
I'm a bit confused because you are asking about Hive yet you show example of MySqlOperator. In any case assuming the the sql/hql parameter is templated you can use execution_date directly in your query. Thus you can extract the year & month to be used for the partition value.
Example:
mysql_task = MySqlOperator(
dag=dag,
task_id='mysql_task',
sql="""SELECT {{ execution_date.strftime('%y%m') }}""",
)
So in your sample_sql.hql it will be:
ALTER TABLE sample_df DROP IF EXISTS
PARTITION (
cpd_ym = {{ execution_date.strftime('%y%m') }}
) PURGE;
You mentioned that you are new to Airflow so make sure you are aware what execution_date is and how it's being calculated (if you are not check this answer). You can do string manipulations to other macros as well. Choose the macro that is suitable to your needs (execution_date / prev_execution_date / next_execution_date / etc...).

with-constrained consecutive updates

Please assume I have built a query in MS Sqlserver, it has the following structure:
WITH issues_a AS
(
SELECT a_prop
FROM ds_X x
)
, issues_b AS
(
SELECT key
, z.is_flagged as is_flagged
, some_prop
FROM ds_Z z
JOIN issues_a i_a
ON z.a_diff = i_a.a_prop
)
-- {{ run }}
UPDATE samples
SET error =
CASE
WHEN i_b.some_prop IS NULL THEN '#1 ...'
WHEN UPPER(i_b.is_flagged) != 'Y' THEN '#2 ...'
END
FROM samples s
left join issues_b i_b ON s.key = i_b.key;
Now I want enhance the whole thing, updating one more table in a consecutive way by enclosing parts of the query in BEGIN TRANSACTION and COMMIT, but don't get my head around the how of it. Tried enclosing the whole expression with the transaction parenthesis, but that didn't bring me any further.
Are there any other ways to achieve the above task - even without concatenating the consecutive updates in a transactional manner, though better it would be?
For abbreviation the task again: WITH <...>(...), <...>(...) UPDATE <... Using data from latter WITH> UPDATE <... using data from latter WITH>?
Hope you don't mind my poor grammar...

PostgreSQL function multiple rows returned from subquery in case statement

So I have written a postgreSQL function that is supposed to do a search on a table based on a huge amount of optional input parameters which i group with lots of AND statements. This one however:
AND
(
(newcheck IS NULL)
OR
(
newcheck IS NOT NULL AND product.id IN(
CASE WHEN newcheck='New'
THEN
(SELECT product.id FROM product WHERE product.anew IS true)
ELSE
(SELECT product.id from product WHERE product.anew IS false)
END)
)
)
gives me a
ERROR: more than one row returned by a subquery used as an expression
This isnt helping much since I do want it to return a lot more than one row.
The values of the newcheck variable will be sent from a dropdown menu in a web form so it can only be 'New' or 'Old'.
Any ideas on what might be causing this problem?
Try something like:
AND ((newcheck IS NULL)
OR (newcheck IS NOT NULL
AND product.id IN (SELECT product.id
FROM product
WHERE product.anew = CASE WHEN newcheck='New'
THEN true
ELSE false
END))