How do you write incremental models in DBT? - dbt

I have two large tables I'm trying to join & filter using dbt.
The SQL is very simple, roughly:
SELECT
u.user_id, t.transaction_id
FROM users u
JOIN transactions t ON t.user_id = u.user_id
WHERE u.active = 1
Currently I'm using the "table" materialization, but this is fairly wasteful, as the tables underlying tables are 99.99% the same from run to run.
However, I don't understand from the DBT documentation how I could set this model to "incremental".
Any ideas?
PS. I'm running on SQL Server.

As #anders-swanson wrote in his comment, if transaction_id is definitely unique, you could set it as the unique_key and materialize your model as an incremental table.
dbt's docs explain how to do this. Using your example, it might be:
{{
config(
materialized='incremental',
unique_key='transaction_id'
)
}}
select
u.user_id, t.transaction_id
from users u
join transactions t ON t.user_id = u.user_id
where u.active = 1
If transaction_id is not unique but transaction_id||user_id is, you could try creating a new column which concatenates these columns in an upstream dbt model, and then assigning that as the unique_key:
{{
config(
materialized='incremental',
unique_key='pkey'
)
}}
select
u.user_id,
t.transaction_id,
u.user_id||t.transaction_id as pkey
from users u
join transactions t ON t.user_id = u.user_id
where u.active = 1
Otherwise, you'll have to pull in a column that is either a) unique, or b) has an ordered quality that could be used to apply an is_incremental() filter (like #viacheslav-nefedov wrote).

If you have a date field you can use it to load only the last data.
I.e. you have transaction_date column.
{{ config(
materialized='incremental',
as_columnstore=false,
pre_hook="""
{% if is_incremental() %}
delete from {{this}}
where transaction_date >= '{{ (modules.datetime.datetime.now() - modules.datetime.timedelta(2)).isoformat() }}'
{% endif %}
"""
)}}
SELECT
u.user_id, t.transaction_id
FROM users u
JOIN transactions t ON t.user_id = u.user_id
WHERE u.active = 1
{% if is_incremental() %}
and transaction_date >= '{{ (modules.datetime.datetime.now() - modules.datetime.timedelta(2)).isoformat() }}'
{% endif %}
The first time you run this model it will ignore all the code under "if is_incremental()". All the upcoming runs it will clean up transactions for the last two days and reload them.

Related

PostgreSQL - Optimize subquery by referencing outer query

I have two tables: users and orders. Orders is a massive table (>100k entries) and users is relatively small (around 400 entries).
I want to find the number of orders per user. The column linking both tables is the email column.
I can achieve this with the following query:
SELECT sub_1.num, u.id FROM users AS u,
(SELECT cust_email AS email, COUNT(purchaseid) AS num
FROM orders AS o
WHERE o.status = 'COMPLETED'
GROUP BY cust_email) sub_1
WHERE u.email = sub_1.email
ORDER BY createdate DESC NULLS LAST
However, as mentioned previously, the order table is very large, so I would ideally want to add another condition to the WHERE clause in the Subquery to only retrieve those emails that exist in the User table.
I can simply add the user table to the subquery like this:
SELECT sub_1.num, u.id FROM users AS u,
(SELECT cust_email AS email, COUNT(purchaseid) AS num
FROM orders AS o, users AS u
WHERE o.status = 'COMPLETED'
and o.cust_email = u.email
GROUP BY cust_email) sub_1
WHERE u.email = sub_1.email
ORDER BY createdate DESC NULLS LAST
This does speed up the query, but sometimes the outer query is much more complex than just selecting all entries from the user table. Therefore, this solution does not always work. The goal would be to somehow link the outer and the inner query. I've thought of joint queries but cannot figure out how to get it to work.
I noticed that the first query seems to perform faster than I expected, so perhaps PostgreSQL is already smart enough to connect the outer and inner tables. However, I was hoping that someone could shed some light on how this works and what the best way to perform these types of subqueries is.

postgres select where value must exist for list of array

I am stuck with a query problem in postgres. I need to select all the userids where value must exist for certain array of rules and these rule ids are coming from a subquery.
Here is my table structure:
users: {user_id, name, email}
rules: {rule_id, name, description, query}
user_scores: {user_id, rule_id, points}
So in the above example I need to find out all the users who has played all the rules.
Actually i wanted to avoid using backend language to loop and check if a user has scores for every rule. I need to fetch all userids from user_scores table where all ruleids exists. I am actually looking for IN operation but with AND in nature like
select user_id
from user_scores
WHERE rule_id IN (select rule_id from rules)
group by user_id
So instead of 'IN' it should work like 'IN for ALL'.. or something like that
Any help will geatly appriciated.
To get a list of all users that have a score for all rules, you can use an outer join with rules:
SELECT u.user_id, u.name
FROM users AS u
JOIN user_scores AS us ON u.user_id = us.user_id
LEFT JOIN rules AS r ON us.rule_id = r.rule_id
GROUP BY u.user_id, u.name
HAVING (count(*) FILTER (WHERE r.rule_id IS NULL)) = 0;
I couldn't test it, so there may be some slight errors.

Improve performance postgresql query

I have 3 tables: users, posts and likes. A post is called hot post if it has more than 5 likes within the first hour after post creation. The following is using to query for a list of hot posts. Can anyone help me to improve this query (how to index or rewrite it).
SELECT post.id,
post.content,
user.username,
COUNT(like.id)
FROM posts AS post
LEFT OUTER JOIN users AS user
ON post.user_id = user.id
INNER JOIN likes AS likes
ON post.id = likes.post_id
AND likes.created_at - INTERVAL '1 hour' < post.created_at
GROUP BY post.id, user.username
HAVING COUNT(like.id) >= 5
ORDER BY post.created_at DESC;
First, unless there really can be a post that does not belong to a user, use an inner join there.
Assuming that there is a good number of posts and likes, the best join strategy would be a merge join or a hash join, which PostgreSQL should choose automatically.
For a merge join, the following indexes might be helpful:
CREATE INDEX ON posts (id);
CREATE INDEX ON likes (post_id);
No index could help with a hash join in this case.
If the planner chooses a nested loop join after all, it might be useful to rewrite the query to:
... AND likes.created_at < post.created_at + INTERVAL '1 hour'
and create an index like
CREATE INDEX ON likes (post_id, created_at);

SQL query - get rows from a table based on one condition and through join table based on another condition

I have a tags table - id, name, owner_id (owner_id is FK for users)
and a user_tags table - user_id, tag_id (linking table between users a tags for the purpose of sharing those tags - ie, users who can access the tag but aren't the owner)
I have a query that can get me tags through a join on the user_tags table:
SELECT tags . *
FROM tags
JOIN user_tags ON user_tags.user_id =2
AND user_tags.tag_id = tags.id
LIMIT 0 , 30
But in that same query I'd also like to select tags WHERE tags.owner_id = 2, getting all tags shared with that user through the linking table(user_tags) and also tags that user owns (tags.owner_id = user_id).
If I include WHERE tags.owner_id = 2 after the join, It only returns results where tags.owner_id = 2.
If I include OR tags.owner_id = 2, I get repeats of all the results.
If I make the statement SELECT DISTINCT... OR tags.owner_id = 2 I end up with the correct result set, but I'm not sure that's the correct way to do this join with condition.
Is there a better way/best practice?
Also, why does a join return multiples of the results (ie why is SELECT DISTINCT or GROUP BY necessary?
Thank you.
EDIT FOR CLARIFICATION OF STRUCTURE
I wouldn't use user_tags.user_id as part of the join condition. Just do specify both conditions in the where clause to make your intent clearer. But to answer your question, yes you would need to de-dupe tags with DISTINCT if one tags.id can be associated to many user_tags.tag_id
SELECT DISTINCT tags.*
FROM tags
JOIN user_tags ON tags.id = user_tags.tag_id
WHERE user_tags.user_id = 2
OR user_tags.owner_id = 2
LIMIT 0,30

'Autosuggestion' feature implementation for a webapp

I'm developing a web application and have two models (among others) - users and items with many-to-many association. So I have have tables 'users', 'items' and 'items_users' with primary key 'id' and foreign keys user_id and item_id.
What I'm going to have is an 'autosuggestion' feature. If, say, I'm as a user mark a certain item as good, the system is supposed to suggest n items I most probably would also mark as good. The reasonable criteria for autosuggestion is how many users who liked the first item like another one. If all users who like tea also like a teapot - then the teapot is in top position for autosuggestion.
This is basic functionality, I'll also filter some results but the rest doesn't matter. I'm thinking about some kind of an auxiliary table for fast calculation on demand or scheduling a separate process to calculate n suggestions.
Thank you for any related information!
UPD
The question sounded unclear. I have sql db and sinatra with sequel orm. I'm asking about how to calculate most similar items dataset (cheapest, least resourse consuming approach). How would you implement it?
So, generally you want to select all users that liked the same products then get the products they like by counting the numer of likes for each product and output the most liked products.
So how would this look in SQL?
Let's see how would this look in SQL:
Step 1: Get the id's of your favourites
SELECT it.item_id FROM `item_users` it WHERE it.user_id = %current_user%
Step 2: Get the users who like the same items
SELECT u.id FROM `item_users` it, `users` u WHERE it.item_id IN (
SELECT it.item_id FROM `item_users` it WHERE it.user_id = %current_user%
) AND it.user_id != %current_user% AND u.id = it.user_id GROUP BY it.user_id
Step 3: Get their favourites
And the entire SQL query would look like this:
SELECT i.* FROM `items` i, `item_users` it WHERE it.user_id IN (
SELECT u.id FROM `item_users` it, `users` u WHERE it.item_id IN (
SELECT it.item_id FROM `item_users` it WHERE it.user_id = %current_user%
) AND it.user_id != %current_user% AND u.id = it.user_id GROUP BY it.user_id
) AND i.id = it.item_id GROUP BY i.id ORDER BY count(*) DESC
Your task is to add limiting of the results...
UPDATE:
I guess that you would like to get the most populat products first. I've changed the query to add that functionality (added ORDER BY count(*) DESC to the end)
This is a complex query and using ActiveRecord to implement it would be quite slow and even more complicated, so I would recommend you using the query as is.
Use your link table to join users and items.
Apply following filters in your WHERE-Clause:
- users that liked the item ("marked it as good")
- items, that the current user did not already mark as good
Sort descending by the number of likes (you'll need to group by the item id and count the users).