JOIN on ID, IF ID doesn't match then match on other columns BigQuery

JOIN on ID, IF ID doesn't match then match on other columns BigQuery - sql

I have two tables that I am trying to join. The tables have a primary and foreign key but there are some instances where the keys don't match and I need to join on the next best match.
I tried to use a case statement and it works but because the join isn't perfect. It will either grab the incorrect value or duplicate the record.
The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
I need a way to match on the IDs and then the SQL stops matching on that row. But im not sure if that is something BigQuery can do.
Customer Table
Cust_ID Cust_InfoID Cust_name Cust_Start Cust_Lev1
1111 1 Amy 2021-01-01 A
1112 3 John 2020-01-01 D
1113 8 Bill 2020-01-01 D
Info Table
Info_ID Info_Lev1 Info_Start Info_End state
1 A 2021-01-15 2021-01-14 NJ
3 D 2020-01-01 2020-12-31 NY
5 A 2021-01-01 2022-01-31 CA
Expected Result
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1112 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
Join Idea 1:
CASE
WHEN
(Cust_InfoID = Info_ID) = true
AND (Cust_Start BETWEEN Info_Start AND Info_End) = true
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the dates don't so it uses the ELSE statement to join. This is incorrect
Join Idea 2:
CASE
WHEN
Cust_InfoID = Info_ID
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the ELSE statement also matches up the wrong duplicate row. This is also incorrect
Example tables here:
with customer as (
SELECT 1111 Cust_ID,1 Cust_InfoID,'Amy' Cust_name,'2021-01-01' Cust_Start,'A' Cust_Lev1
UNION ALL
SELECT 1112,3,'John','2020-01-01','D'
union all
SELECT 1113,8,'Bill','2020-01-01','D'
),
info as (
select 1 Info_ID,'A' Info_Lev1,'2021-01-15' Info_Start,'2021-01-14' Info_End,'NJ' state
union all
select 3,'D','2020-01-01','2020-12-31','NY'
union all
select 5,'A','2021-01-01','2022-01-31','CA'
)
select Cust_ID,Cust_InfoID,Info_ID,Cust_Lev1,Cust_Start,Info_Start,Info_End,state
from customer
join info on
[case statement here]

The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
Use two left joins, one for each of the conditions:
select c.*,
coalesce(ii.info_start, il.info_start),
coalesce(ii.info_end, il.info_end),
coalesce(ii.state, il.state)
from customer c left join
info ii
on c.cust_infoid = ii.info_id left join
info il
on ii.info_id is null and
c.cust_lev1 = il.info_lev1 and
c.cust_start between il.info_start and il.info_end

Consider below ("with one JOIN and a CASE statement" as asked)
select any_value(c).*,
array_agg(i order by
case when c.cust_infoid = i.info_id then 1 else 2 end
limit 1
)[offset(0)].*
from `project.dataset.customer` c
join `project.dataset.info` i
on c.cust_infoid = i.info_id
or(
c.cust_lev1 = i.info_lev1 and
c.cust_start between i.info_start and i.info_end
)
group by format('%t', c)
when applied to sample data in your question - output is

Related

How to execute join between three slow change dimensions sort by all start date columns?

I'm trying to join data between three slow change dimension type 2. When I query the result, the sort by date between the dimensions are not as expected.
I have the slow change dimensions below:
Table Subsidiaries
id
name
subsidiary
department
start_date_dep
end_date_dep
last_record_flg
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
1
John Doe
AL
Sales
2013-01-01
2014-05-01
0
1
John Doe
NY
Sales
2014-05-01
1
38
Ivy Johnson
NY
Sales
2020-06-01
1
Table Functions
id
function
start_date_fun
end_date_fun
last_record_flg
1
operator
2005-10-01
2009-08-01
0
1
leader
2009-08-01
2011-10-01
0
1
manager
2011-10-01
2017-07-01
0
1
director
2017-07-01
1
38
operator
2020-06-01
1
Table Graduations
id
university_graduation
conclusion_date
last_record_flg
1
bachelor
15/12/2005
0
1
master
15/12/2008
1
38
bachelor
15/12/2014
1
The desired result is:
id
name
subsidiary
department
start_date_dep
end_date_dep
last_record_flg
function
start_date_fun
end_date_fun
last_record_flg
university_graduation
conclusion_date
last_record_flg
max_date
seq
start
end
last_record_flg
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
operator
2005-10-01
2009-08-01
0
bachelor
2005-12-15
0
2005-12-15
1
2005-10-01
2008-12-15
0
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
operator
2005-10-01
2009-08-01
0
master
2008-12-15
1
2008-12-15
1
2008-12-15
2009-08-01
0
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
leader
2009-08-01
2011-10-01
0
master
2008-12-15
1
2009-08-01
1
2009-08-01
2011-10-01
0
1
John Doe
AL
Engineering
2005-10-01
2013-01-01
0
manager
2011-10-01
2017-07-01
0
master
2008-12-15
1
2011-10-01
1
2011-10-01
2013-01-01
0
1
John Doe
AL
Sales
2013-01-01
2014-05-01
0
manager
2011-10-01
2017-07-01
0
master
2008-12-15
1
2013-01-01
1
2013-01-01
2014-05-01
0
1
John Doe
NY
Sales
2014-05-01
NULL
1
manager
2011-10-01
2017-07-01
0
master
2008-12-15
1
2014-05-01
1
2014-05-01
2017-07-01
0
1
John Doe
NY
Sales
2014-05-01
NULL
1
director
2017-07-01
NULL
1
master
2008-12-15
1
2017-07-01
1
2017-07-01
NULL
1
38
Ivy Johnson
NY
Sales
2020-06-01
NULL
1
operator
2020-06-01
NULL
1
bachelor
2014-12-15
1
2020-06-01
1
2020-06-01
NULL
1
I tried with CROSS APPLY, but is returning only one line for each id. I'm trying with CASE WHEN but the query output is not exactly equal the desired result. In my return the column 'FUNCTION' and 'START_DATE_FUN' not follow the sequence (sort) presented in the desired result, the same occur for columns 'UNIVERSITY_GRADUATION' and 'CONCLUSION_DATE'.
The query:
select
*
from(
select
tb.*
,row_number() over(partition by tb.id,tb.max_date order by tb.max_date) as seq
,tb.max_date as [start]
,lead( tb.max_date ) over( partition by tb.id order by tb.max_date ) as [end]
,case when lead( tb.max_date ) over( partition by tb.id order by tb.max_date ) is null then 1 else 0 end as last_record_flg
from(
select
sb.id
,sb.[name]
,sb.subsidiary
,sb.department
,sb.start_date_dep
,sb.end_date_dep
,sb.last_record_flg as lr_sb
,fc.[function]
,fc.start_date_fun
,fc.end_date_fun
,fc.last_record_flg as lr_fc
,gd.university_graduation
,gd.end_date_grad
,gd.last_record_flg as lr_gd
,case
when sb.start_date_dep >= fc.start_date_fun and sb.start_date_dep >= gd.end_date_grad then sb.start_date_dep
when fc.start_date_fun >= sb.start_date_dep and fc.start_date_fun >= gd.end_date_grad then fc.start_date_fun
else gd.end_date_grad
end as max_date
from
#Subsidiaries as sb
left outer join #Functions as fc
on sb.id = fc.id
left outer join #Graduations as gd
on sb.id = gd.id
) as tb
) as tb2
where
tb2.seq = 1
Below the DDL:
create table #Subsidiaries (
id int
,[name] varchar(15)
,subsidiary varchar(2)
,department varchar(15)
,start_date_dep date
,end_date_dep date
,last_record_flg bit
)
go
insert into #Subsidiaries values
(1,'John Doe','AL','Engineering','2005-10-01','2013-01-01',0),
(1,'John Doe','AL','Sales','2013-01-01','2014-05-01',0),
(1,'John Doe','NY','Sales','2014-05-01',null,1),
(38,'Ivy Johnson','NY','Sales','2020-06-01',null,1)
go
create table #Functions (
id int
,[function] varchar(15)
,start_date_fun date
,end_date_fun date
,last_record_flg bit
)
go
insert into #Functions values
(1,'operator','2005-10-01','2009-08-01',0),
(1,'leader','2009-08-01','2011-10-01',0),
(1,'manager','2011-10-01','2017-07-01',0),
(1,'director','2017-07-01',null,1),
(38,'operator','2020-06-01',null,1)
go
create table #Graduations (
id int
,university_graduation varchar(15)
,end_date_grad date
,last_record_flg bit
)
go
insert into #Graduations values
(1,'bachelor','2005-12-15',0),
(1,'master','2008-12-15',1),
(38,'bachelor','2014-12-15',1)
go

Case when someone find the same difficult to join two or more SCD type 2, I could find a reference in this link https://sqlsunday.com/2014/11/30/joining-two-scd2-tables/ (SQL Sunday) that help me to build the query and use the range intervals in the join condition to return result as desired.

Need help joining incremental data to a fact table in an incremental manor

TableA
ID
Counter
Value
1
1
10
1
2
28
1
3
34
1
4
22
1
5
80
2
1
15
2
2
50
2
3
39
2
4
33
2
5
99
TableB
StartDate
EndDate
2020-01-01
2020-01-11
2020-01-02
2020-01-12
2020-01-03
2020-01-13
2020-01-04
2020-01-14
2020-01-05
2020-01-15
2020-01-06
2020-01-16
TableC (output)
ID
Counter
StartDate
EndDate
Val
1
1
2020-01-01
2020-01-11
10
2
1
2020-01-01
2020-01-11
15
1
2
2020-01-02
2020-01-12
28
2
2
2020-01-02
2020-01-12
50
1
3
2020-01-03
2020-01-13
34
2
3
2020-01-03
2020-01-13
39
1
4
2020-01-04
2020-01-14
22
2
4
2020-01-04
2020-01-14
33
1
5
2020-01-05
2020-01-15
80
2
5
2020-01-05
2020-01-15
99
1
1
2020-01-06
2020-01-16
10
2
1
2020-01-06
2020-01-16
15
I am attempting to come up with some SQL to create TableC. What TableC is, it takes the data from TableB, in chronological order, and for each ID in tableA, it finds the next counter in the sequence, and assigns that to the Start/End date combination for that ID, and when it reaches the end of the counter, it will start back at 1.
Is something like this even possible with SQL?

Yes this is possible. Try to do the following:
Calculate maximal value for Counter in TableA using SELECT MAX(Counter) ... into max_counter.
Add identifier row_number to each row in TableB so it will be able to find matching Counter value using SELECT ROW_NUMBER() OVER() ....
Establish relation between row number in TableB and Counter in TableA like this ... FROM TableB JOIN TableA ON (COALESCE(NULLIF(TableB.row_number % max_counter = 0), max_counter)) = TableA.Counter.
Then gather all these queries using CTE (Common Table Expression) into one query as official documentation shows.

Consider below approach
select id, counter, StartDate, EndDate, value
from tableA
join (
select *, mod(row_number() over(order by StartDate) - 1, 5) + 1 as counter
from tableB
)
using (counter)
if applied to sample data in your question - output is

Query to find active days per year to find revenue per user per year

I have 2 dimension tables and 1 fact table as follows:
user_dim
user_id
user_name
user_joining_date
1
Steve
2013-01-04
2
Adam
2012-11-01
3
John
2013-05-05
4
Tony
2012-01-01
5
Dan
2010-01-01
6
Alex
2019-01-01
7
Kim
2019-01-01
bundle_dim
bundle_id
bundle_name
bundle_type
bundle_cost_per_day
101
movies and TV
prime
5.5
102
TV and sports
prime
6.5
103
Cooking
prime
7
104
Sports and news
prime
5
105
kids movie
extra
2
106
kids educative
extra
3.5
107
spanish news
extra
2.5
108
Spanish TV and sports
extra
3.5
109
Travel
extra
2
plans_fact
user_id
bundle_id
bundle_start_date
bundle_end_date
1
101
2019-10-10
2020-10-10
2
107
2020-01-15
(null)
2
106
2020-01-15
2020-12-31
2
101
2020-01-15
(null)
2
103
2020-01-15
2020-02-15
1
101
2020-10-11
(null)
1
107
2019-10-10
2020-10-10
1
105
2019-10-10
2020-10-10
4
101
2021-01-01
2021-02-01
3
104
2020-02-17
2020-03-17
2
108
2020-01-15
(null)
4
102
2021-01-01
(null)
4
103
2021-01-01
(null)
4
108
2021-01-01
(null)
5
103
2020-01-15
(null)
5
101
2020-01-15
2020-02-15
6
101
2021-01-01
2021-01-17
6
101
2021-01-20
(null)
6
108
2021-01-01
(null)
7
104
2020-02-17
(null)
7
103
2020-01-17
2020-01-18
1
102
2020-12-11
(null)
2
106
2021-01-01
(null)
7
107
2020-01-15
(null)
note: NULL bundle_end_date refers to active subscription.
user active days can be calculated as: bundle_end_date - bundle_start_date (for the given bundle)
total revenue per user could be calculated as : total no. of active days * bundle rate per day
I am looking to write a query to find revenue generated per user per year.
Here is what I have for the overall revenue per user:
select pf.user_id
, sum(datediff(day, pf.bundle_start_date, coalesce(pf.bundle_end_date, getdate())) * bd.price_per_day) total_cost_per_bundle
from plans_fact pf
inner join bundle_dim bd on bd.bundle_id = pf.bundle_id
group by pf.user_id
order by pf.user_id;

You need a 'year' table to help parse out each multi-year spanning row into it's seperate years. For each year, you need to also recalculate the start and end dates. That's what I do in the yearParsed cte in the code below. I hard code the years into the join statement that creates y. You probably will do it different but however you get those values will work.
After that, pretty much sum as you did before, just adding the year column to your grouping.
Aside from that, all I did was move the null coalesce logic to the cte to make the overall logic simpler.
with yearParsed as (
select pf.*,
y.year,
startDt = iif(pf.bundle_start_date > y.startDt, pf.bundle_start_date, y.startDt),
endDt = iif(ap.bundle_end_date < y.endDt, ap.bundle_end_date, y.endDt)
from plans_fact pf
cross apply (select bundle_end_date = isnull(pf.bundle_end_date, getdate())) ap
join (values
(2019, '2019-01-01', '2019-12-31'),
(2020, '2020-01-01', '2020-12-31'),
(2021, '2021-01-01', '2021-12-31')
) y (year, startDt, endDt)
on pf.bundle_start_date <= y.endDt
and ap.bundle_end_date >= y.startDt
)
select yp.user_id,
yp.year,
total_cost_per_bundle = sum(datediff(day, yp.startDt, yp.endDt) * bd.bundle_cost_per_day)
from yearParsed yp
join bundle_dim bd on bd.bundle_id = yp.bundle_id
group by yp.user_id,
yp.year
order by yp.user_id,
yp.year;
Now, if this is common, you should probably create a base-table for your 'year' table. But if it's not common, but for this report you don't want to have to keep coming back to hard-code the year information into the y table, you can do this:
declare #yearTable table (
year int,
startDt char(10),
endDt char(10)
);
with y as (
select year = year(min(pf.bundle_start_date))
from #plans_fact pf
union all
select year + 1
from y
where year < year(getdate())
)
insert #yearTable
select year,
startDt = convert(char(4),year) + '-01-01',
endDt = convert(char(4),year) + '-12-31'
from y;
and it will create the appropriate years for you. But you can see why creating a base table may be preferred if you have this or a similar need often.

How can I create a column which computes only the change of other column on redshift?

I have this dataset:
product customer date value buyer_position
A 123455 2020-01-01 00:01:01 100 1
A 123456 2020-01-02 00:02:01 100 2
A 523455 2020-01-02 00:02:05 100 NULL
A 323455 2020-01-03 00:02:07 100 NULL
A 423455 2020-01-03 00:09:01 100 3
B 100455 2020-01-01 00:03:01 100 1
B 999445 2020-01-01 00:04:01 100 NULL
B 122225 2020-01-01 00:04:05 100 2
B 993848 2020-01-01 10:04:05 100 3
B 133225 2020-01-01 11:04:05 100 NULL
B 144225 2020-01-01 12:04:05 100 4
The dataset has the product the company sells and the customers who saw the product. A customer can see more than one product, but the combination product + customer doesn't have any repetition. I want to get how many people bought the product before the customer sees it.
This would be the perfect output:
product customer date value buyer_position people_before
A 123455 2020-01-01 00:01:01 100 1 0
A 123456 2020-01-02 00:02:01 100 2 1
A 523455 2020-01-02 00:02:05 100 NULL 2
A 323455 2020-01-03 00:02:07 100 NULL 2
A 423455 2020-01-03 00:09:01 100 3 2
B 100455 2020-01-01 00:03:01 100 1 0
B 999445 2020-01-01 00:04:01 100 NULL 1
B 122225 2020-01-01 00:04:05 100 2 1
B 993848 2020-01-01 10:04:05 100 3 2
B 133225 2020-01-01 11:04:05 100 NULL 3
B 144225 2020-01-01 12:04:05 100 4 3
As you can see, when the customer 122225 saw the product he wanted, two people have already bought it. In the case of customer 323455, two people have already bought the product A.
I think I should use some window function, like lag(). But lag() function won't get this "cumulative" information. So I'm kind of lost here.

This looks like a window count of non-null values of buyer_position over the preceding rows:
select t.*,
coalesce(count(buyer_position) over(
partition by product
order by date
rows between unbounded preceding and 1 preceding
), 0) as people_before
from mytable t

Hmmm . . . If I understand correctly, You want the max of the buyer position for the customer/product minus 1:
select t.*,
max(buyer_position) over (partition by customer, product order by date rows between unbounded preceding and current row) - 1
from t;

Exclude rows where keys match, but are on different rows

I'm looking for the best way to produce the result set in the scenario provided. My cust3 column isn't identifying the repeated values in the indvid2 column. The end result I'm looking for is to exclude the rows where key1 and key2 match (ids:1,2,6 and 7), then sum accounts where the acctids match.If there's a better way to code this, I welcome all suggestions. Thanks!
WITH T10 as (
SELECT acctid,invid,(
case
when invid like '%-R' then left (InvID,LEN(invid) -2) else InvID
END) as InvID2
FROM table x
GROUP BY acctID,invID
),
T11 as (
SELECT acctid, Invid2, COUNT(InvID2) as cust3
FROM T10
GROUP BY InvID2,acctid
HAVING
COUNT (InvID2) > 1
)
select DISTINCT
a.acctid,
a.name,
b.invid,
C.invid2,
D.cust3,
b.amt,
b.key1,
b.key2
from table a
inner join table b (nolock) on a.acctid = b.acctid
inner join T10 C (nolock) on b.invid = c.invid
inner join T11 D (nolock) on C.invid2 = D.invid2
Resultset
id acctID name invid invid2 Cust3 amt key1 key2
1 123 James 101 101 2 $500 NULL 6789
2 123 james 101-R 101 2 ($500) 6789 NULL
3 123 James 102 102 2 $350 NULL NULL
4 123 James 103 103 2 $200 NULL NULL
5 246 Tony 98-R 98 2 ($750) 7423 NULL
6 432 David 45 45 2 $100 NULL 9634
7 432 David 45-R 45 2 ($100) 9634 NULL
8 359 Stan 39-R 39 2 ($50) 6157 NULL
9 753 George 95 95 2 $365 NULL NULL
10 753 George 108 108 2 $100 NULL NULL
Desired Resultset
id acctID name invid invid2 Cust3 amt key1 key2
1 123 James 101 101 2 $500 NULL 6789
2 123 james 101-R 101 2 ($500) 6789 NULL
3 123 James 102 102 1 $350 NULL NULL
4 123 James 103 103 1 $200 NULL NULL
5 246 Tony 98-R 98 1 ($750) 7423 NULL
6 432 David 45 45 2 $100 NULL 9634
7 432 David 45-R 45 2 ($100) 9634 NULL
8 359 Stan 39-R 39 1 ($50) 6157 NULL
9 753 George 95 95 1 $365 NULL NULL
10 753 George 108 108 1 $100 NULL NULL
Then to sum amt by acctid
id acctid name amt
1 123 James $550
2 246 Tony ($750)
3 359 Stan ($50)
4 753 George $465

Something like:
;WITH Keys as (
SELECT Key1.acctID, [Key] = Key1.Key1
FROM YourTable as Key1
INNER JOIN YourTable as Key2
ON Key1.Key1 = Key2.Key2 and Key1.acctID = Key2.acctID
)
SELECT t.acctID, t.name, amt = SUM(t.amt)
FROM YourTable as t
LEFT JOIN Keys as k
ON t.acctID = k.acctID and (t.Key1 = [Key] or t.Key2 = [Key])
WHERE k.acctID is Null
GROUP BY t.acctID, t.name

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

JOIN on ID, IF ID doesn't match then match on other columns BigQuery - sql

Related

How to execute join between three slow change dimensions sort by all start date columns?

Need help joining incremental data to a fact table in an incremental manor

Query to find active days per year to find revenue per user per year

How can I create a column which computes only the change of other column on redshift?

Exclude rows where keys match, but are on different rows

Categories

Resources