How can I join fact tables on incremental models without loosing data?

How can I join fact tables on incremental models without loosing data? - sql

I have two tables containing different events, Table A and Table B, both of them are partitioned tables. I need to join these two tables, however, if I filter using a timestamp, I'll lose some events due to partitioning filter.
Example - Table A:
event_a_id
timestamp
event_b_id
a1
2023-10-01
b1
a2
2023-10-01
b2
Table B:
event_b_id
timestamp
text
b1
2023-01-01
lorem
b2
2023-10-01
ipsum
Result:
event_a_id
timestamp
event_b_id
text
a1
2023-10-01
b1
null
a2
2023-10-01
b2
ipsum
If I filter both tables on timestamp = "2023-10-01" I will get event b2, but not event b1. How can I avoid this? I can't simple select the whole table, because it is huge, however, I can't publish the table with missing data.
I have tried filtering only 1 of tables, it reduces the amount of data processed, but does not solve the problem of missing information on the rows.

There is a good write-up of your options by Tristan Handy, CEO of dbt Labs, here.
To summarize, you can design for correctness or performance, but you may need to accept some tradeoffs.
Max performance would be filtering both tables on the current date, as you describe:
select a.id, a.timestamp, a.event_b_id, b.text
from a
left join b
on a.event_b_id = b.id
{% if is_incremental() %}
and b.timestamp >= (select max(timestamp) from {{ this }})
{% endif %}
{% if is_incremental() %}
where a.timestamp >= (select max(timestamp) from {{ this }})
{% endif %}
Max correctness would be only filtering table A:
select a.id, a.timestamp, a.event_b_id, b.text
from a
left join b
on a.event_b_id = b.id
{% if is_incremental() %}
where a.timestamp >= (select max(timestamp) from {{ this }})
{% endif %}
A compromise solution might include creating a window for late-arriving data in table A. For example, join rows in B if they were recorded less than 30 days before A:
select a.id, a.timestamp, a.event_b_id, b.text
from a
left join b
on a.event_b_id = b.id
{% if is_incremental() %}
and b.timestamp >= (
select date_sub(max(timestamp), interval '30 days')
from {{ this }}
)
{% endif %}
{% if is_incremental() %}
where a.timestamp >= (select max(timestamp) from {{ this }})
{% endif %}
(If your data in B arrives late, you would flip this logic around; you could also include a range on both tables A and B).

Related

SQL: How to remove duplicate rows created by CASE WHEN statement

I have two tables: (A) customers of the gym and (B) customers of the restaurant. I want to create an indicator in table (A) to indicate the customers who have been to both the gym and the restaurant on the same day. In accomplishing this, I used the following SQL script, but it created duplicate rows:
SELECT *,
CASE WHEN a.GymDate = b.RestaurantDate THEN 'Meal + Gym on the same day'
ELSE 'Gym Only' END AS 'Meal+Gym'
FROM Table_A a
LEFT JOIN Table_B b
ON a.customerid = b.customerid;
May I know how to keep only Table_A, but with the addition of the 'Meal+Gym' Indicator? Thanks!

A case expression does not generate rows, it is your join that is generating the duplicate rows. You could add the date predicate to the join condition, and merely check for the existence of a record, e.g.
SELECT *,
CASE WHEN b.customerid IS NOT NULL THEN 'Meal + Gym on the same day'
ELSE 'Gym Only'
END AS [Meal+Gym]
FROM Table_A a
LEFT JOIN Table_B b
ON a.customerid = b.customerid
AND a.GymDate = b.RestaurantDate;
If table_B is not unique per customer/Date then you may need to do something like this to prevent duplicates:
SELECT *,
CASE WHEN r.RestaurantVisit IS NOT NULL THEN 'Meal + Gym on the same day'
ELSE 'Gym Only'
END AS [Meal+Gym]
FROM Table_A a
OUTER APPLY
( SELECT TOP 1 1
FROM Table_B b
WHERE a.customerid = b.customerid
AND a.GymDate = b.RestaurantDate
) AS r (RestaurantVisit);
N.B. While using single quotes works for column aliases, it is not a good habit at all, because it makes your column aliases indistinguishable from string literals other than from context. Even if this is clear to you, it probably isn't to other people, and since there's about a 10:1 ratio of reading:writing code, writing code that is easy to read is important. As such I've used square brackets for your column name instead

I would start with a table of customers, so you get an indicator for customers who have been to neither the gym nor a restaurant.
Then:
select c.*,
(case when exists (select 1
from table_a a join
table_b b
on a.customerid = b.customerid and
a.GymDate = b.RestaurantDate
where a.customerid = c.customerid
)
then 1 else 0
end) as same_day_gym_restaurant_flag
from customers c;

You can use CASE WHEN EXISTS instead of the LEFT JOIN:
SELECT *,
CASE WHEN EXISTS (
SELECT 1 FROM Table_B b
WHERE a.customerid = b.customerid
AND a.GymDate = b.RestaurantDate)
THEN 'Meal + Gym on the same day'
ELSE 'Gym Only'
END AS 'Meal+Gym'
FROM Table_A a
This assumes that you don't need any data from Table_B in the results.

SQL - Find productID that doesn't have shadows containing a substring as productID

I have the following code:
SELECT DISTINCT TOP 100
Product.ID
FROM db.Product
WHERE 1=1
AND ShadowOf = '' --This means that the product isn't a shadow itself, but a shadow parent, containing shadows.
AND Product.ID NOT IN (select ShadowOf from Product WHERE ShadowOf <> '') --This makes sure that the product doesn't have any shadows.
Order By Product.ID ASC
A shadow is basically like a copy of the product that contains the same attributes and values, images, etc.
The table goes like this:
ID |ShadowOf|Shadows
A | |B
B | A |
B.TE| C |
C | |B.TE
The same product can have multiple shadows;
The same product can be a ShadowOf only one other product;
A product that is already a shadow of another product can't have shadows of it's own.
What I need to do:
Find all Product IDs that are not Shadows(meaning ShadowOf values for them is empty) and don't have shadows that end in ".TE"(They can have any other value in the Shadows column)
What I tried to do:
AND Product.ID IN (SELECT ShadowOf FROM Product WHERE ShadowOf NOT LIKE '%.TE')
What am I expecting to get based on the sample table above:
ID
A
Because A is the only product that is NOT a shadow, and doesn't have a shadow ending in .TE
Edited some mistakes.

Since a product can have multiple shadows, a NOT EXISTS clause is probably the most practical solution:
WHERE 1=1
AND Product.Status=1
AND ShadowOf = ''
AND NOT EXISTS (SELECT * FROM Product p2 WHERE p2.ID = Product.ID AND p2.Shadows LIKE '%.TE')
Demo on SQLFiddle

You can alter a little your query to get the result
Select DISTINCT a.ID
From Product a
Where (ShadowOf = '' or ShadowOf is null)
AND Product.ID NOT IN (SELECT distinct ID FROM Product b WHERE Shadows LIKE '%.TE')

pgSQL FULL OUTER JOIN 'WHERE' Condition

I am trying to create a single query to retrieve the current price and special sale price if a sale is running; When there isn't a sale on I want store_picture_monthly_price_special.price AS special_price to return as null.
Before adding the 2nd WHERE condition the query executes as I expect it to: store_picture_monthly_price_special.price returns null since there is no sale running at present.
store_picture_monthly_reference | tenure | name | regular_price | special_price
3 | 12 | 12 Months | 299.99 | {Null}
2 | 3 | 3 Months | 79.99 | {Null}
1 | 1 | 1 Month | 29.99 | {Null}
pgSQL is treating the 2nd WHERE condition as "all or none". If there is no sale running there are no results.
Is it possible to tweak this query so I get regular pricing each and every time and special sale price either as a dollar value when a special is running or returning null Is what I am trying to do be accomplished require sub-query?
This is the query how I presently have it:
SELECT store_picture_monthly.reference AS store_picture_monthly_reference , store_picture_monthly.tenure , store_picture_monthly.name , store_picture_monthly_price_regular.price AS regular_price , store_picture_monthly_price_special.price AS special_price
FROM ( store_picture_monthly INNER JOIN store_picture_monthly_price_regular ON store_picture_monthly_price_regular.store_picture_monthly_reference = store_picture_monthly.reference )
FULL OUTER JOIN store_picture_monthly_price_special ON store_picture_monthly.reference = store_picture_monthly_price_special.store_picture_monthly_reference
WHERE
( store_picture_monthly_price_regular.effective_date < NOW() )
AND
( NOW() BETWEEN store_picture_monthly_price_special.begin_date AND store_picture_monthly_price_special.end_date )
GROUP BY store_picture_monthly.reference , store_picture_monthly_price_regular.price , store_picture_monthly_price_regular.effective_date , store_picture_monthly_price_special.price
ORDER BY store_picture_monthly_price_regular.effective_date DESC
Table "store_picture_monthly"
reference bigint,
name text,
description text,
tenure bigint,
available_date timestamp with time zone,
available_membership_reference bigint
Table store_picture_monthly_price_regular
reference bigint ,
store_picture_monthly_reference bigint,
effective_date timestamp with time zone,
price numeric(10,2),
membership_reference bigint
Table store_picture_monthly_price_special
reference bigint,
store_picture_monthly_reference bigint,
begin_date timestamp with time zone,
end_date timestamp with time zone,
price numeric(10,2),
created_date timestamp with time zone DEFAULT now(),
membership_reference bigint

The description of the problem suggests that you want a LEFT JOIN, not a FULL JOIN. FULL JOINs are quite rare, particularly in databases with well defined foreign key relationships.
In your case, the WHERE clause is turning your FULL JOIN into a LEFT JOIN anyway, because the WHERE clause requires valid values from the first table.
SELECT spm.reference AS store_picture_monthly_reference,
spm.tenure, spm.name,
spmpr.price AS regular_price,
spmps.price AS special_price
FROM store_picture_monthly spm INNER JOIN
store_picture_monthly_price_regularspmpr
ON spmpr.store_picture_monthly_reference = spm.reference LEFT JOIN
store_picture_monthly_price_special spmps
ON spm.reference = spmps.store_picture_monthly_reference AND
NOW() BETWEEN spmps.begin_date AND spmps.end_date
WHERE spmpr.effective_date < NOW();
Notes:
I introduced table aliases so the query is easier to write and to read.
The condition on the dates for the sale are now in the ON clause.
I removed the GROUP BY. It doesn't seem necessary. If it is, you can use SELECT DISTINCT instead. And, I would investigate data problems if this is needed.
I am suspicious about the date comparisons. NOW() has a time component. The naming of the comparison columns suggests that the are just dates with no time.

Any time that you put a where predicate on a table that is outer joined it converts the outer join to an inner join because the nulls introduced by the outer join can never be compared to anything to produce a true (so the outer join puts a load of rows-with-nulls in where rows don't match, and then the WHERE takes the entire row out again)
Consider this simpler example:
SELECT * FROM
a LEFT JOIN b ON a.id = b.id
WHERE b.col = 'value'
Is identical to:
SELECT * FROM
a INNER JOIN b ON a.id = b.id
WHERE b.col = 'value'
To resolve this, move the predicate out of the where and into the ON
SELECT * FROM
a LEFT JOIN b ON a.id = b.id AND b.col = 'value'
You can also consider:
SELECT * FROM
a LEFT JOIN b ON a.id = b.id
WHERE b.col = 'value' OR b.col IS NULL
but this might pick up data you don't want, if b.col naturally contains some nulls; it cannot differentiate between nulls that are natively present in b.col and nulls that are introduced by a fail in the join to match a row from b with a row from a (unless we also look at the nullness of the joined id column)
A
id
1
2
3
B
id, col
1, value
3, null
--wrong, behaves like inner
A left join B ON a.id=b.id WHERE b.col = 'value'
1, 1, value
--maybe wrong, b.id 3 might be unwanted
A left join B ON a.id=b.id WHERE b.col = 'value' or b.col is null
1, 1, value
2, null, null
3, 3, null
--maybe right, simpler to maintain than the above
A left join B ON a.id=b.id AND b.col = 'value'
1, 1, value
2, null, null
3, null, null
In these last two the difference is b.id is null or not, though the row count is the same. If we were counting b.id our count could end up wrong. It's important to appreciate this nuance of join behavior. You might even want it, if you were looking to exclude row 3 but include row 2, by crafting a where clause of a LEFT JOIN b ON a.id=b.id WHERE b.col = 'value' OR b.id IS NULL - this will keep row 2 but exclude row 3 because even though the join succeeds to find a b.id of 3 it is not kept by either predicate

How to ignore lines in sql query which specific id php

I have a simply shop with php and I need to ignore some products in shop on manage page. How to possible to make ignore in SQL query?
Here is my query:
$query = "SELECT a.*,
a.user as puser,
a.id as pid,
b.date as date,
b.price as price,
b.job_id as job_id,
b.masterkey as masterkey
FROM table_shop a
INNER JOIN table_shop_s b ON a.id = b.buyid
WHERE b.payok = 1
ORDER BY buyid";
I need to ignore list with product_id = "3","4" from table table_shop_s in this query

WHERE b.payok = 1 AND tablename.product_id != 3 AND tablename.product_id != 4

Simply use NOT IN (to ignore specific pids), with AND logical condition. Use the following:
$query = "SELECT a.*,
a.user as puser,
a.id as pid,
b.date as date,
b.price as price,
b.job_id as job_id,
b.masterkey as masterkey
FROM table_shop a
INNER JOIN table_shop_s b ON a.id = b.buyid
WHERE b.payok = 1
AND a.id NOT IN (3,4)
ORDER BY buyid";

Other answer has noted you would probably use a "productid NOT IN (3,4)" which would work, but that would be a short-term fix. Extend the thinking a bit. 2 products now, but in the future you have more you want to hide / prevent? What then, change all your queries and miss something?
My suggestion would be to update your product table. Add a column such as ExcludeFlag and have it set to 1 or 0... 1 = Yes, Exclude, 0 = ok, leave it alone. Then join your shop detail table to products and exclude when this flag is set... Also, you only need to "As" columns when you are changing their result column name, Additionally, by doing A.*, you are already getting ALL columns from alias "a" table, do you really need to add the extra instances of "a.user as puser, a.id as pid" ?
something like
SELECT
a.*,
b.date,
b.price,
b.job_id,
b.masterkey
FROM
table_shop a
INNER JOIN table_shop_s b
ON a.id = b.buyid
AND b.payok = 1
INNER JOIN YourProductTable ypt
on b.ProductID = ypt.ProductID
AND ypt.ExcludeFlag = 0
ORDER BY
a.id
Notice the extra join and specifically including all those where the flag is NOT set.
Also, good practice to alias table names closer to context of purpose vs just "a" and "b" much like my example of long table YourProductTable aliased as ypt.
I also changed the order by to "a.id" since that is the primary table in your query and also, since a.id = b.buyid, it is the same key order anyhow and probably is indexed on your "a" table too. the table_shop_s table I would assume already has an index on (buyid), but might improve when you get a lot of records to be indexed on (buyid, payok) to better match your JOINING criteria on both parts.

MS SQL - Problem selecting a subset of records

I'm having a SQL brainfart moment. I am trying to get a set of records when any of the attribute IDs for that product is a certain value.
Problem is, I need to get all other attributes for that same product along with it.
Here's an illustration for what I mean:
Is there a way to do that? Currently I am doing this
select product_id
from mytable
where product_attribute_id = 154
But I obviously only get the single record:
Any help would be greatly appreciated. My SQL skills are a bit basic.
EDIT
There's one condition I forgot to mention. There are times where I need to be able to filter on two attribute IDs. For example, in the first image above, the lower set (product ID 31039) has attribute id 395. I would need to filter on 154, 395. The result would not include the top set (31046) which does not have an attribute id 395.

I think is what you're looking for:
SELECT * myTable where Product_Id IN (SELECT Product_Id FROM MyTable WHERE Product_AttributeID = #parameterValue)
In English: Get me all the records such that their product id is in the set of all product ids such that their attribute id is equal to #parameterValue.
EDIT:
SELECT * myTable where Product_Id IN (SELECT Product_Id FROM MyTable WHERE Product_AttributeID = #parameterValue1) AND Product_Id IN (SELECT Product_Id FROM MyTable WHERE Product_AttributeID = #parameterValue2)
That should do it.

Using proper joins, you can link back to the same table
select B.*
from mytable A
-- retrieve B records from A record link
inner join mytable B on B.product_id = A.product_id
where A.product_attribute_id = 154 -- all the A records
EDIT: to get products that have 2 attributes, you can join another time
select C.*
from mytable A
-- retrieve B records from A record link
inner join mytable B on B.product_id = A.product_id
inner join mytable C on C.product_id = A.product_id
where A.product_attribute_id = 154 -- has attrib 1
AND B.product_attribute_id = 313 -- has attrib 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas