How to execute join between three slow change dimensions sort by all start date columns? - sql

I'm trying to join data between three slow change dimension type 2. When I query the result, the sort by date between the dimensions are not as expected.
I have the slow change dimensions below:
Table Subsidiaries
id
name
subsidiary
department
start_date_dep
end_date_dep
last_record_flg
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
1
John Doe
AL
Sales
2013-01-01
2014-05-01
0
1
John Doe
NY
Sales
2014-05-01
1
38
Ivy Johnson
NY
Sales
2020-06-01
1
Table Functions
id
function
start_date_fun
end_date_fun
last_record_flg
1
operator
2005-10-01
2009-08-01
0
1
leader
2009-08-01
2011-10-01
0
1
manager
2011-10-01
2017-07-01
0
1
director
2017-07-01
1
38
operator
2020-06-01
1
Table Graduations
id
university_graduation
conclusion_date
last_record_flg
1
bachelor
15/12/2005
0
1
master
15/12/2008
1
38
bachelor
15/12/2014
1
The desired result is:
id
name
subsidiary
department
start_date_dep
end_date_dep
last_record_flg
function
start_date_fun
end_date_fun
last_record_flg
university_graduation
conclusion_date
last_record_flg
max_date
seq
start
end
last_record_flg
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
operator
2005-10-01
2009-08-01
0
bachelor
2005-12-15
0
2005-12-15
1
2005-10-01
2008-12-15
0
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
operator
2005-10-01
2009-08-01
0
master
2008-12-15
1
2008-12-15
1
2008-12-15
2009-08-01
0
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
leader
2009-08-01
2011-10-01
0
master
2008-12-15
1
2009-08-01
1
2009-08-01
2011-10-01
0
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
manager
2011-10-01
2017-07-01
0
master
2008-12-15
1
2011-10-01
1
2011-10-01
2013-01-01
0
1
John Doe
AL
Sales
2013-01-01
2014-05-01
0
manager
2011-10-01
2017-07-01
0
master
2008-12-15
1
2013-01-01
1
2013-01-01
2014-05-01
0
1
John Doe
NY
Sales
2014-05-01
NULL
1
manager
2011-10-01
2017-07-01
0
master
2008-12-15
1
2014-05-01
1
2014-05-01
2017-07-01
0
1
John Doe
NY
Sales
2014-05-01
NULL
1
director
2017-07-01
NULL
1
master
2008-12-15
1
2017-07-01
1
2017-07-01
NULL
1
38
Ivy Johnson
NY
Sales
2020-06-01
NULL
1
operator
2020-06-01
NULL
1
bachelor
2014-12-15
1
2020-06-01
1
2020-06-01
NULL
1
I tried with CROSS APPLY, but is returning only one line for each id. I'm trying with CASE WHEN but the query output is not exactly equal the desired result. In my return the column 'FUNCTION' and 'START_DATE_FUN' not follow the sequence (sort) presented in the desired result, the same occur for columns 'UNIVERSITY_GRADUATION' and 'CONCLUSION_DATE'.
The query:
select
*
from(
select
tb.*
,row_number() over(partition by tb.id,tb.max_date order by tb.max_date) as seq
,tb.max_date as [start]
,lead( tb.max_date ) over( partition by tb.id order by tb.max_date ) as [end]
,case when lead( tb.max_date ) over( partition by tb.id order by tb.max_date ) is null then 1 else 0 end as last_record_flg
from(
select
sb.id
,sb.[name]
,sb.subsidiary
,sb.department
,sb.start_date_dep
,sb.end_date_dep
,sb.last_record_flg as lr_sb
,fc.[function]
,fc.start_date_fun
,fc.end_date_fun
,fc.last_record_flg as lr_fc
,gd.university_graduation
,gd.end_date_grad
,gd.last_record_flg as lr_gd
,case
when sb.start_date_dep >= fc.start_date_fun and sb.start_date_dep >= gd.end_date_grad then sb.start_date_dep
when fc.start_date_fun >= sb.start_date_dep and fc.start_date_fun >= gd.end_date_grad then fc.start_date_fun
else gd.end_date_grad
end as max_date
from
#Subsidiaries as sb
left outer join #Functions as fc
on sb.id = fc.id
left outer join #Graduations as gd
on sb.id = gd.id
) as tb
) as tb2
where
tb2.seq = 1
Below the DDL:
create table #Subsidiaries (
id int
,[name] varchar(15)
,subsidiary varchar(2)
,department varchar(15)
,start_date_dep date
,end_date_dep date
,last_record_flg bit
)
go
insert into #Subsidiaries values
(1,'John Doe','AL','Engineering','2005-10-01','2013-01-01',0),
(1,'John Doe','AL','Sales','2013-01-01','2014-05-01',0),
(1,'John Doe','NY','Sales','2014-05-01',null,1),
(38,'Ivy Johnson','NY','Sales','2020-06-01',null,1)
go
create table #Functions (
id int
,[function] varchar(15)
,start_date_fun date
,end_date_fun date
,last_record_flg bit
)
go
insert into #Functions values
(1,'operator','2005-10-01','2009-08-01',0),
(1,'leader','2009-08-01','2011-10-01',0),
(1,'manager','2011-10-01','2017-07-01',0),
(1,'director','2017-07-01',null,1),
(38,'operator','2020-06-01',null,1)
go
create table #Graduations (
id int
,university_graduation varchar(15)
,end_date_grad date
,last_record_flg bit
)
go
insert into #Graduations values
(1,'bachelor','2005-12-15',0),
(1,'master','2008-12-15',1),
(38,'bachelor','2014-12-15',1)
go

Case when someone find the same difficult to join two or more SCD type 2, I could find a reference in this link https://sqlsunday.com/2014/11/30/joining-two-scd2-tables/ (SQL Sunday) that help me to build the query and use the range intervals in the join condition to return result as desired.

Related

SQL - Period range in subgroups of a group by

I have the following dataset:
A
B
C
1
John
2018-08-14
1
John
2018-08-20
1
John
2018-09-03
2
John
2018-11-13
2
John
2018-12-11
2
John
2018-12-12
1
John
2020-01-20
1
John
2020-01-21
3
John
2021-03-02
3
John
2021-03-03
1
John
2020-05-10
1
John
2020-05-12
And I would like to have the following result:
A
B
C
1
John
2018-08-14
2
John
2018-11-13
1
John
2020-01-20
3
John
2021-03-02
1
John
2020-05-10
If I group by A, B the 1st row and the third just concatenate which is coherent. How could I create another columns to still use a group by and have the result I want.
If you have another ideas than mine, please explain it !
I tried to use some first, last, rank, dense_rank without success.
Use lag(). Looks like B is a function of A in your data. So checking lag(A) will suffice.
select A,B,C
from (
select *, case when lag(A) over(order by C) = A then 0 else 1 end startFlag
from mytable
) t
where startFlag = 1
order by C

JOIN on ID, IF ID doesn't match then match on other columns BigQuery

I have two tables that I am trying to join. The tables have a primary and foreign key but there are some instances where the keys don't match and I need to join on the next best match.
I tried to use a case statement and it works but because the join isn't perfect. It will either grab the incorrect value or duplicate the record.
The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
I need a way to match on the IDs and then the SQL stops matching on that row. But im not sure if that is something BigQuery can do.
Customer Table
Cust_ID Cust_InfoID Cust_name Cust_Start Cust_Lev1
1111 1 Amy 2021-01-01 A
1112 3 John 2020-01-01 D
1113 8 Bill 2020-01-01 D
Info Table
Info_ID Info_Lev1 Info_Start Info_End state
1 A 2021-01-15 2021-01-14 NJ
3 D 2020-01-01 2020-12-31 NY
5 A 2021-01-01 2022-01-31 CA
Expected Result
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1112 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
Join Idea 1:
CASE
WHEN
(Cust_InfoID = Info_ID) = true
AND (Cust_Start BETWEEN Info_Start AND Info_End) = true
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the dates don't so it uses the ELSE statement to join. This is incorrect
Join Idea 2:
CASE
WHEN
Cust_InfoID = Info_ID
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the ELSE statement also matches up the wrong duplicate row. This is also incorrect
Example tables here:
with customer as (
SELECT 1111 Cust_ID,1 Cust_InfoID,'Amy' Cust_name,'2021-01-01' Cust_Start,'A' Cust_Lev1
UNION ALL
SELECT 1112,3,'John','2020-01-01','D'
union all
SELECT 1113,8,'Bill','2020-01-01','D'
),
info as (
select 1 Info_ID,'A' Info_Lev1,'2021-01-15' Info_Start,'2021-01-14' Info_End,'NJ' state
union all
select 3,'D','2020-01-01','2020-12-31','NY'
union all
select 5,'A','2021-01-01','2022-01-31','CA'
)
select Cust_ID,Cust_InfoID,Info_ID,Cust_Lev1,Cust_Start,Info_Start,Info_End,state
from customer
join info on
[case statement here]
The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
Use two left joins, one for each of the conditions:
select c.*,
coalesce(ii.info_start, il.info_start),
coalesce(ii.info_end, il.info_end),
coalesce(ii.state, il.state)
from customer c left join
info ii
on c.cust_infoid = ii.info_id left join
info il
on ii.info_id is null and
c.cust_lev1 = il.info_lev1 and
c.cust_start between il.info_start and il.info_end
Consider below ("with one JOIN and a CASE statement" as asked)
select any_value(c).*,
array_agg(i order by
case when c.cust_infoid = i.info_id then 1 else 2 end
limit 1
)[offset(0)].*
from `project.dataset.customer` c
join `project.dataset.info` i
on c.cust_infoid = i.info_id
or(
c.cust_lev1 = i.info_lev1 and
c.cust_start between i.info_start and i.info_end
)
group by format('%t', c)
when applied to sample data in your question - output is

Exclude rows where keys match, but are on different rows

I'm looking for the best way to produce the result set in the scenario provided. My cust3 column isn't identifying the repeated values in the indvid2 column. The end result I'm looking for is to exclude the rows where key1 and key2 match (ids:1,2,6 and 7), then sum accounts where the acctids match.If there's a better way to code this, I welcome all suggestions. Thanks!
WITH T10 as (
SELECT acctid,invid,(
case
when invid like '%-R' then left (InvID,LEN(invid) -2) else InvID
END) as InvID2
FROM table x
GROUP BY acctID,invID
),
T11 as (
SELECT acctid, Invid2, COUNT(InvID2) as cust3
FROM T10
GROUP BY InvID2,acctid
HAVING
COUNT (InvID2) > 1
)
select DISTINCT
a.acctid,
a.name,
b.invid,
C.invid2,
D.cust3,
b.amt,
b.key1,
b.key2
from table a
inner join table b (nolock) on a.acctid = b.acctid
inner join T10 C (nolock) on b.invid = c.invid
inner join T11 D (nolock) on C.invid2 = D.invid2
Resultset
id acctID name invid invid2 Cust3 amt key1 key2
1 123 James 101 101 2 $500 NULL 6789
2 123 james 101-R 101 2 ($500) 6789 NULL
3 123 James 102 102 2 $350 NULL NULL
4 123 James 103 103 2 $200 NULL NULL
5 246 Tony 98-R 98 2 ($750) 7423 NULL
6 432 David 45 45 2 $100 NULL 9634
7 432 David 45-R 45 2 ($100) 9634 NULL
8 359 Stan 39-R 39 2 ($50) 6157 NULL
9 753 George 95 95 2 $365 NULL NULL
10 753 George 108 108 2 $100 NULL NULL
Desired Resultset
id acctID name invid invid2 Cust3 amt key1 key2
1 123 James 101 101 2 $500 NULL 6789
2 123 james 101-R 101 2 ($500) 6789 NULL
3 123 James 102 102 1 $350 NULL NULL
4 123 James 103 103 1 $200 NULL NULL
5 246 Tony 98-R 98 1 ($750) 7423 NULL
6 432 David 45 45 2 $100 NULL 9634
7 432 David 45-R 45 2 ($100) 9634 NULL
8 359 Stan 39-R 39 1 ($50) 6157 NULL
9 753 George 95 95 1 $365 NULL NULL
10 753 George 108 108 1 $100 NULL NULL
Then to sum amt by acctid
id acctid name amt
1 123 James $550
2 246 Tony ($750)
3 359 Stan ($50)
4 753 George $465
Something like:
;WITH Keys as (
SELECT Key1.acctID, [Key] = Key1.Key1
FROM YourTable as Key1
INNER JOIN YourTable as Key2
ON Key1.Key1 = Key2.Key2 and Key1.acctID = Key2.acctID
)
SELECT t.acctID, t.name, amt = SUM(t.amt)
FROM YourTable as t
LEFT JOIN Keys as k
ON t.acctID = k.acctID and (t.Key1 = [Key] or t.Key2 = [Key])
WHERE k.acctID is Null
GROUP BY t.acctID, t.name

SQL Query to extract Average Quantity

I want to calculate the average quantity customer purchases in one basket
(Where 1 Basket = Multitple purchases in 1 day)
from the table transaction and update it into cust360.average_basket_Qty
Transactions:-
Trasactions_ID cust_id Tran_date Qty Total_amt
80712190438 270351 2014-02-28 00:00:00.000 5 4265.3
29258453508 270384 2014-02-27 00:00:00.000 5 8270.925
51750724947 273420 2014-02-24 00:00:00.000 2 1748.11
93274880719 271509 2014-02-24 00:00:00.000 3 4518.345
51750724947 273420 2014-02-23 00:00:00.000 2 1748.11
97439039119 272357 2014-02-23 00:00:00.000 2 1821.04
45649838090 273667 2014-02-22 00:00:00.000 1 1602.25
Cust360 Table: -
Cust_id Gender Age Basket_count Total_sale Date Average_basket_Qty
266783 M 525 0 3113 2013-02-20 NULL
266784 F 314 1 5694 2012-12-04 NULL
266785 F 392 0 21613 2013-08-01 NULL
266788 F 551 0 6092 2013-02-12 NULL
266794 F 564 1 27981 2014-02-12 NULL
Following query should work for you
UPDATE DEST
SET DEST.Average_basket_Qty = SRC.AVG_QTY
FROM Cust360 DEST
INNER JOIN
(
SELECT CUST_ID, CAST(TRAN_DATE AS DATE), AVG(QTY) AVG_QTY
FROM Transactions
GROUP BY CUST_ID,CAST(TRAN_DATE AS DATE)
) SRC ON SRC.CUST_ID=DEST.Cust_id

Complete datediff calculation based on MAX and MIN values

Sorry I did post a question similar earlier, but I was not that clear. I have a table with the fields, Customer, ID_Date, Pstng_Date, SUMOfAmount, Days_BetweenMax and days_between Min.
What I want is a query that shows me the date difference between the pstng_date and the ID_Date where the pstng_date is the max value for that customer and another column that shows the same calculation where the pstng_date is the minimum value for that customer. Those customers with only one Pstng_date should display as zero
So the Query should display the results like this:
Customer ID_Date Pstng_Date SumOfAmount Days_BetweenMAX days_betweenMIN
-------- ---------- ---------- ----------- ------------
Holmes 31/01/2014 10/01/2014 $21,545.59 0 0
James 31/01/2014 10/01/2014 -$21,197.89 0 21
James 31/01/2014 5/01/2014 -$7,823.14 0 0
James 31/01/2014 24/01/2014 $308.00 7 0
Rod 31/01/2014 17/01/2014 -$2,603.95 0 0
Lisa 31/01/2014 17/01/2014 $22,019.49 0 0
Assuming that your existing table is called [Postings], you could create a query to calculate the MIN() and MAX() values of [Pstng_Date]
SELECT
Customer,
MIN(Pstng_Date) AS MinOfPstng_Date,
MAX(Pstng_Date) AS MaxOfPstng_Date
FROM Postings
GROUP BY Customer
returning
Customer MinOfPstng_Date MaxOfPstng_Date
-------- --------------- ---------------
Holmes 2014-01-10 2014-01-10
James 2014-01-05 2014-01-24
Lisa 2014-01-17 2014-01-17
Rod 2014-01-17 2014-01-17
Then you could use that as a subquery in the query to calculate the date differences
SELECT
p.Customer,
p.ID_Date,
p.Pstng_Date,
p.SumOfAmount,
IIf(q.MaxOfPstng_Date=q.MinOfPstng_Date,0,IIf(p.Pstng_Date=q.MaxOfPstng_Date,DateDiff("d",p.Pstng_Date,p.ID_Date),0)) AS Days_BetweenMAX,
IIf(q.MaxOfPstng_Date=q.MinOfPstng_Date,0,IIf(p.Pstng_Date=q.MinOfPstng_Date,DateDiff("d",p.Pstng_Date,p.ID_Date),0)) AS Days_BetweenMIN
FROM
Postings AS p
INNER JOIN
(
SELECT
Customer,
MIN(Pstng_Date) AS MinOfPstng_Date,
MAX(Pstng_Date) AS MaxOfPstng_Date
FROM Postings
GROUP BY Customer
) AS q
ON p.Customer = q.Customer
returning
Customer ID_Date Pstng_Date SumOfAmount Days_BetweenMAX Days_BetweenMIN
-------- ---------- ---------- ----------- --------------- ---------------
Holmes 2014-01-31 2014-01-10 21545.59 0 0
James 2014-01-31 2014-01-10 -21197.89 0 0
James 2014-01-31 2014-01-05 -7823.14 0 26
James 2014-01-31 2014-01-24 308.00 7 0
Rod 2014-01-31 2014-01-17 -2603.95 0 0
Lisa 2014-01-31 2014-01-17 22019.49 0 0