Postgres JOIN on multiple possible columns with OR statement - sql

I have two tables that I want to join together:
contracts:
id
customer_id_1
customer_id_2
customer_id_3
date
1
MAIN1
TRAN1
TRAN2
20201101
2
MAIN2
20201001
3
MAIN3
TRAN5
20200901
4
MAIN4
TRAN7
TRAN8
20200801
customers:
id
customer_id
info
date
1
MAIN1
blah
20200930
2
TRAN2
blah
20200929
3
TRAN5
blah
20200831
4
TRAN7
blah
20200801
In my contracts table, each row represents a contract with a customer, who may have 1 or more different IDs they are referred to by in the customers table. In the customers table, I have info on customers (can be zero or multiple records on different dates for each customer). I want to perform a join from contracts onto customers such that I get the most recent info available on a customer at the time a contract is recorded, ignoring any potential customer info that may be available after the contract date. I am also not interested in contracts which have no info on the customers. The main problem here is that in customers, each customer record can reference any 1 of the 3 IDs that may exist.
I currently have the following query which performs the task as intended but the problem is that is extremely slow when run on data in the 50-100k rows range. If I remove the OR statements in the INNER JOIN and just join on the the first ID, the query performs in seconds as opposed to ~ half an hour.
SELECT
DISTINCT ON (ctr.id)
ctr.id,
ctr.customer_id_1,
ctr.date AS contract_date,
cst.info,
cst.date AS info_date
FROM
contracts ctr
INNER JOIN customers cst ON (
cst.customer_id = ctr.customer_id_1
OR cst.customer_id = ctr.customer_id_2
OR cst.customer_id = ctr.customer_id_3
)
AND ctr.date >= cst.date
ORDER BY
ctr.id,
cst.date DESC
Result:
id
customer_id_1
contract_date
info
info_date
1
MAIN1
20201101
blah
20200930
3
MAIN3
20200901
blah
20200831
4
MAIN4
20200801
blah
20200801
It seems like OR statements in JOINs aren't very common (I've barely found any examples online) and I presume this is because there must be a better way of doing this. So my question is, how can this be optimised?

OR often is a performance killer in SQL predicates.
One alternative unpivots before joining:
select distinct on (ctr.id)
ctr.id,
ctr.customer_id_1,
ctr.date as contract_date,
cst.info,
cst.date as info_date
from contracts ctr
cross join lateral (values
(ctr.customer_id_1), (ctr.customer_id_2), (ctr.customer_id_3)
) as ctx(customer_id)
inner join customers cst on cst.customer_id = ctx.customer_id and ctr.date >= cst.date
order by ctr.id, cst.date desc
The use of this techniques pinpoints that your could vastly improve your data model: the relation between contracts and customers should be stored in a separate table, with each customer/contract tuple on a separate row - essentially, what the query does is virtually build that derived table in the lateral join.

Related

COUNT with multiple LEFT joins [duplicate]

This question already has answers here:
Two SQL LEFT JOINS produce incorrect result
(3 answers)
Closed 12 months ago.
I am having some troubles with a count function. The problem is given by a left join that I am not sure I am doing correctly.
Variables are:
Customer_name (buyer)
Product_code (what the customer buys)
Store (where the customer buys)
The datasets are:
Customer_df (list of customers and product codes of their purchases)
Store1_df (list of product codes per week, for Store 1)
Store2_df (list of product codes per day, for Store 2)
Final output desired:
I would like to have a table with:
col1: Customer_name;
col2: Count of items purchased in store 1;
col3: Count of items purchased in store 2;
Filters: date range
My query looks like this:
SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,
FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)
GROUP BY
customer_name, product_code
HAVING
S1_sales > 0
OR S2_sales > 0
The output I expect is something like this:
Customer_name
Product_code
Store1_weekly_sales
Store2_weekly_sales
Luigi
120012
4
8
James
100022
6
10
But instead, I get:
Customer_name
Product_code
Store1_weekly_sales
Store2_weekly_sales
Luigi
120012
290
60
James
100022
290
60
It works when instead of COUNT(product_code) I do COUNT(DSITINCT product_code) but I would like to avoid that because I would like to be able to aggregate on different timespans (e.g. if I do count distinct and take into account more than 1 week of data I will not get the right numbers)
My hypothesis are:
I am joining the tables in the wrong way
There is a problem when joining two datasets with different time aggregations
What am I doing wrong?
The reason as Philipxy indicated is common. You are getting a Cartesian result from your data thus bloating your numbers. To simplify, lets consider just a single customer purchasing one item from two stores. The first store has 3 purchases, the second store has 5 purchases. Your total count is 3 * 5. This is because for each entry in the first is also joined by the same customer id in the second. So 1st purchase is joined to second store 1-5, then second purchase joined to second store 1-5 and you can see the bloat. So, by having each store pre-query the aggregates per customer will have AT MOST, one record per customer per store (and per product as per your desired outcome).
select
c.customer_name,
AllCustProducts.Product_Code,
coalesce( PQStore1.SalesEntries, 0 ) Store1SalesEntries,
coalesce( PQStore2.SalesEntries, 0 ) Store2SalesEntries
from
customer_df c
-- now, we need all possible UNIQUE instances of
-- a given customer and product to prevent duplicates
-- for subsequent queries of sales per customer and store
JOIN
( select distinct customerid, product_code
from store1_df
union
select distinct customerid, product_code
from store2_df ) AllCustProducts
on c.customerid = AllCustProducts.customerid
-- NOW, we can join to a pre-query of sales at store 1
-- by customer id and product code. You may also want to
-- get sum( SalesDollars ) if available, just add respectively
-- to each sub-query below.
LEFT JOIN
( select
s1.customerid,
s1.product_code,
count(*) as SalesEntries
from
store1_df s1
group by
s1.customerid,
s1.product_code ) PQStore1
on AllCustProducts.customerid = PQStore1.customerid
AND AllCustProducts.product_code = PQStore1.product_code
-- now, same pre-aggregation to store 2
LEFT JOIN
( select
s2.customerid,
s2.product_code,
count(*) as SalesEntries
from
store2_df s2
group by
s2.customerid,
s2.product_code ) PQStore2
on AllCustProducts.customerid = PQStore2.customerid
AND AllCustProducts.product_code = PQStore2.product_code
No need for a group by or having since all entries in their respective pre-aggregates will result in a maximum of 1 record per unique combination. Now, as for your needs to filter by date ranges. I would just add a WHERE clause within each of the AllCustProducts, PQStore1, and PQStore2.

Access SQL: How to retrieve sums of multiple values where user IDs are assigned to multiple positions

I'm working on an Access database for assigning tasks to personnel and tracking task status and workload. A single user ID can be assigned to one of many fields associated with a particular task. In this case, the Task table has fields for "TechReviewerID" "DesignerID" "TechReviewerWorkload" and "DesignerWorkload."
I want one query to return one row for each person, with two summary columns totaling all of the workload assigned to them. So if I'm ID1, I want column 3 to return the sum of "TechReviewerWorkload" in all tasks where "TechReviewerID = 1" and column 4 to return the sum of "DesignerWorkload" in all tasks where "DesignerID = 1."
I have successfully written two separate queries that accomplish this:
SELECT MESPersonnel.MESID, MESPersonnel.PersonnelName,
IIF(SUM(DesignerTask.DesignerWorkload) IS NULL, 0, SUM(DesignerTask.DesignerWorkload)) AS
TotalDesignerWorkload
FROM
(MESPersonnel LEFT OUTER JOIN Task AS DesignerTask ON (MESPersonnel.MESID =
DesignerTask.DesignerID
AND DesignerTask.DueDate < CDATE('2020-07-30') AND DesignerTask.DueDate > CDATE ('2020-05-01')))
WHERE MESPersonnel.PositionID = 1
GROUP BY MESPersonnel.MESID, MESPersonnel.PersonnelName;
This query gives the following table:
MESID PersonnelName TotalDesignerWorkload
1 John Doe 40
2 Dohn Joe 20
I can create a near-identical query by replacing all instances of "designer" terms with "tech reviewer" terms.
What I'm looking for is a table like:
MESID PersonnelName TotalDesignerWorkload TotalReviewerWorkload
1 John Doe 40 10
2 Dohn Joe 20 20
My attempts to combine these two via multiple outer joins resulted in wildly inaccurate sums. I know how to solve that for items on different tables, but I'm not sure how to resolve it when I'm using two items from the same table. Is there some kind of conditional sum I can use in my query that Access supports?
EDIT: Sample Raw Data
Task Table
TaskID DesignerID TechReviewerID DesignerWorkload TechReviewerWorkload DueDate
1 1 2 40 20 06-20-2020
2 2 1 20 10 06-20-2020
MESPersonnel Table
MESID PersonnelName
1 John Doe
2 Dohn Joe
Consider:
Query1: TaskUNION
rearranges data to a normalized structure
SELECT TaskID, DesignerID AS UID, PersonnelName, DesignerWorkload AS Data, DueDate, "Design" AS Cat FROM MESPersonnel
INNER JOIN Task ON MESPersonnel.MESID = Task.DesignerID
UNION SELECT TaskID, TechReviewerID, PersonnelName, TechReviewerWorkload, DueDate, "Tech" FROM MESPersonnel
INNER JOIN Task ON MESPersonnel.MESID = Task.TechReviewerID;
Query2:
TRANSFORM Sum(Data) AS SumData
SELECT UID, PersonnelName
FROM TaskUNION
WHERE DueDate BETWEEN #5/1/2020# AND #7/31/2020#
GROUP BY UID, PersonnelName
PIVOT Cat;
An alternative would involve 2 simple, filtered aggregate queries on Task table then join those 2 queries to MESPersonnel. Here as all-in-one statement:
SELECT MESID, PersonnelName, SumOfDesignerWorkload, SumOfTechReviewerWorkload
FROM (
SELECT DesignerID, Sum(DesignerWorkload) AS SumOfDesignerWorkload
FROM Task WHERE DueDate BETWEEN #5/1/2020# AND #7/31/2020# GROUP BY DesignerID) AS SumDesi
RIGHT JOIN ((
SELECT TechReviewerID, Sum(TechReviewerWorkload) AS SumOfTechReviewerWorkload
FROM Task WHERE DueDate BETWEEN #5/1/2020# AND #7/31/2020# GROUP BY TechReviewerID) AS SumTech
RIGHT JOIN MESPersonnel ON SumTech.TechReviewerID = MESPersonnel.MESID)
ON SumDesi.DesignerID = MESPersonnel.MESID;

Determine records which held particular "state" on a given date

I have a state machine architecture, where a record will have many state transitions, the one with the greatest sort_key column being the current state. My problem is to determine which records held a particular state (or states) for a given date.
Example data:
items table
id
1
item_transitions table
id item_id created_at to_state sort_key
1 1 05/10 "state_a" 1
2 1 05/12 "state_b" 2
3 1 05/15 "state_a" 3
4 1 05/16 "state_b" 4
Problem:
Determine all records from items table which held state "state_a" on date 05/15. This should obviously return the item in the example data, but if you query with date "05/16", it should not.
I presume I'll be using a LEFT OUTER JOIN to join the items_transitions table to itself and narrow down the possibilities until I have something to query on that will give me the items that I need. Perhaps I am overlooking something much simpler.
Your question rephrased means "give me all items which have been changed to state_a on 05/15 or before and have not changed to another state afterwards. Please note that for the example it added 2001 as year to get a valid date. If your "created_at" column is not a datetime i strongly suggest to change it.
So first you can retrieve the last sort_key for all items before the threshold date:
SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id
Next step is to join this result back to the item_transitions table to see to which state the item was switched at this specific sort_key:
SELECT *
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
Finally you only want those who switched to 'state_a' so just add a condition:
SELECT DISTINCT it.item_id
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
WHERE it.to_state='state_a'
You did not mention which DBMS you use but i think this query should work with the most common ones.

2 Queries same logic but different no. of output rows

I have 2 query which both aims to select all batchNo that follows 3 conditions:
ClaimStatus must be 95 or 90
CreatedBy = ProviderLink
The minimum dateUpdate should be from 3pm yesterday until when this query was run
Query 1: Outputs 940 rows
SELECT
DISTINCT bh.BatchNo,
bh.Coverage,
DateUploaded = MIN(csa.DateUpdated)
FROM
Registration2..BatchHeader bh with(nolock)
INNER JOIN ClaimsProcess..BatchHeader bhc with(nolock) on bhc.BatchNo = bh.BatchNo
INNER JOIN ClaimsInfo ci with(nolock) on ci.BatchNo = bhc.BatchNo
INNER JOIN Claims c with(nolock) on c.ClaimNo = ci.ClaimNo
INNER JOIN ClaimStatusAudit csa WITH(NOLOCK) on csa.CLAIMNO = ci.ClaimNo
WHERE c.ClaimStatus in('95','90') AND bhc.CreatedBy = 'PROVIDERLINK'
GROUP BY bh.BatchNo, bh.Coverage
HAVING MIN(CSA.DateUpdated) >= convert(varchar(10),GETDATE() -1,110) + ' 15:00:00.000'
Query 2: Outputs 1314 rows
SELECT
DISTINCT bh.BatchNo,
bh.Coverage
FROM Registration2..BatchHeader bh with(nolock)
INNER JOIN ClaimsProcess..BatchHeader bhc with(nolock) on bhc.BatchNo = bh.BatchNo
INNER JOIN ClaimsInfo ci with(nolock) on ci.BatchNo = bhc.BatchNo
INNER JOIN Claims c with(nolock) on c.ClaimNo = ci.ClaimNo
WHERE c.ClaimStatus in('95','90') AND bhc.CreatedBy = 'PROVIDERLINK'
AND (SELECT MIN(DATEUPDATED) FROM CLAIMSTATUSAUDIT WITH(NOLOCK)WHERE CLAIMNO = ci.ClaimNo) >= convert(varchar(10),GETDATE() -1,110) + ' 15:00:00.000'
Though both got the same logic.. they output different number of rows... I would like to know which among the two is more accurate...
BTW.. Both outputs follow the 3 given conditions..
Your assumption is wrong. These two queries are not employing the same logic, simply because of the order in which each clause is evaluated. Clauses are evaluated in the following order (see here for the full article):
From
Where
Group By
Having
Select
Order By
With that detail out of the way, let's analyze why these two queries return a different number of rows.
The reason you're returning a different number of rows is because of when you are filtering for a date prior to after 3pm today.
In Query 1, you're selecting all Batch Numbers and Coverages that meet two conditions:
1. have corresponding records in all joined tables
2. have the desired claim status and were created by "ProviderLink"
You get this list of records once the From, Where, and Group by clauses have been executed.
You are then running the aggregate calculation (Min) on that set of data, pulling the minimum DateUpdated, yet you have not yet put any restriction on how the DateUpdated should be limited. So when you then group your data and filter the groups using the Having clause, you're filtering out all records that meet criteria from numbers 1 and 2 above and also had a DateUpdated prior to 3pm today. Let's look at an example.
Record 1 has a BatchNo 123 and Coverage A and was last updated on 4/4/2014 12:00:00.000
Record 2 has a BatchNo 123 and Coverage A and was last updated today at 5/1/2014 3:01:00.000
Assuming Records 1 & 2 have corresponding records in all joined tables, query 1 will pull back the distinct BatchNo and Coverage (123 & A, respectively) and find the minimum DateUpdated which is 4/4/2014 12:00:00.000. Then, once grouped, your Having clause will say the DateUpdated is not greater than today at 3pm, so it will filter the grouped records out.
Query 2, on the other hand, is taking a different approach. It will see Records 1 and 2 as the same in terms of BatchNo & Coverage because those values are identical. However, in the where clause (i.e., the initial filtering process), it's only looking for records where the minimum DateUpdated is greater than today at 3pm, so it's finding Record 2, and returning it in the dataset.
I think you will find that is the case with the 374 missing records from Dataset 1.
All that said, and with the understanding that we cannot tell you which dataset is better, you'll find that Query 1 will only show groups of distinct BatchNos & Coverages where the minimum DateUpdated among any of the records falling into that group was last updated after 3pm today. This means Query 1 is returning only BatchNos and Coverages which contain very new records.
Query 2 is returning any distinct BatchNo & Coverage groupings where any record within its group was last updated after 3pm today. So which one is right for you?

SQL view using double joins between tables

I'm trying to create a view using a double join between tables.
I'm working on some travel software, managing holiday bookings. The different items a person pays for can be in different currencies.
I've a table of bookings, and a table of currencies.
There are many different items a person can pay for, all stored in different tables. I've created a view showing the total owed per payment Item type.
e.g. owed for Transfers:
BookingID CurrencyID TotalTransfersPrice
1 1 340.00
2 1 120.00
2 2 100.00
e.g. owed for Extras:
BookingID CurrencyID TotalExtrasPrice
1 1 200.00
1 2 440.00
2 1 310.00
All is good so far.
What I'd like to do is to create a master view that brings this all together:
BookingID CurrencyID TotalExtrasPrice TotalTransfersPrice
1 1 200.00 340.00
1 2 440.00 NULL
2 1 310.00 120.00
2 2 NULL 100.00
I can't figure out how to make the above. I've been experimenting with double joins, as I'm guessing I need to do joins both for the BookingID and the CurrencyID?
Any ideas?
Thanks!
Phil.
For SQL Server
This query allows each {BookingId, CurrencyId} have more than one row in the Transfer and Extras tables.
since you stated
I've created a view showing the total owed per payment Item type.
I'm accumulating them by BookinID and CurrencyID
SELECT ISNULL(transfers.BookingId, extras.BookingId) AS BookingId,
ISNULL(transfers.CurrencyId, extras.CurrencyId) AS CurrencyId,
SUM(TotalExtrasPrice) AS TotalExtrasPrice,
SUM(t.TotalTransfersPrice) AS TotalTransfersPrice
FROM transfers
FULL OUTER JOIN extras ON transfers.BookingId = extras.BookingId and transfers.CurrencyId = extras.CurrencyId
GROUP BY ISNULL(transfers.BookingId, extras.BookingId),ISNULL(transfers.CurrencyId, extras.CurrencyId)
You should try to use the full outer join in joining the two tables: Transfers & Extras. Assuming you are using MySQL platform, the sql query can be:
SELECT t.BookingId,t.CurrencyId,e.TotalExtrasPrice,t.TotalTransfersPrice
FROM transfers as t FULL OUTER JOIN extras as e
ON t.BookingId = e.BookingId AND t.CurrencyId = e.CurrencyId;
use joins
select t.BookingId,t.CurrencyId,e.TotalExtrasPrice,t.TotalTransfersPrice
from transfers as t
join extras as e
on t.BookingId = e.BookingId and t.CurrencyId = e.CurrencyId
If you want to cover the case where a combination of BookingID and CurrencyID only exist in either Transfers or Extras and you still want to include them in the result (rather than finding the intersect) this query will do that:
SELECT IDs.BookingId, IDs.CurrencyID, e.TotalExtrasPrice,t.TotalTransfersPrice
FROM (
SELECT BookingId,CurrencyId FROM transfers
UNION
SELECT BookingId,CurrencyId FROM extras
) IDs
LEFT JOIN transfers t ON IDs.BookingId=t.BookingId AND IDs.CurrencyID=t.CurrencyID
LEFT JOIN extras e ON IDs.BookingId=e.BookingId AND IDs.CurrencyID=e.CurrencyID
This query will produce a result identical to your example.
This works. All you need is a simple full outer join.
SELECT "BookingID", "CurrencyID",
ext."TotalExtrasPrice", trans."TotalTransfersPrice"
FROM Transfers trans FULL OUTER JOIN Extras ext
USING ("BookingID", "CurrencyID");
SQLFiddle demo using Oracle.