I have a dataset where I have to find gaps in the temporal columns and show what happened before and after each section for each person in the list.
If nothing was present in the day before valid_from, I want a 0, and if something was present, I want whatever was present.
If nothing was present the day after valid_to, I want a 0.
See the example below, where person XXX starts in G1 and moves to G2 with only 1 day between. Hence old_value = 0 and new_value is G1 followed by old_value = G1 and new_value = G2.
It should never happen that valid_to next valid_from for the same person_id
Input data:
id
person_id
current_value
valid_from
valid_to
1
XXX
G1
2022-01-01
2022-02-01
2
XXX
G2
2022-02-02
2022-03-01
3
YYY
G1
2022-01-01
2022-02-01
4
YYY
G3
2022-02-02
2022-03-01
5
YYY
G1
2022-04-01
2022-04-30
6
ZZZ
G2
2022-01-01
2022-01-31
Expected result:
person_id
old_value
new_value
valid_from
XXX
0
G1
2022-01-01
XXX
G1
G2
2022-02-02
XXX
G2
0
2022-03-02
YYY
0
G1
2022-01-01
YYY
G1
G3
2022-02-02
YYY
G3
0
2022-02-03
YYY
0
G1
2022-04-01
YYY
G1
0
2022-05-01
ZZZ
0
G2
2022-01-01
ZZZ
G2
0
2022-02-01
What have been tried so far: Self joins and various filters, but that does not catch the case with ZZZ.
Also tried some magic with LAG in an inner query, but no success.
Related
Given the sample data, what would be the most efficient way to do the following:
ID event_type event_id event_date quantity
123 B 0 2022-12-31 3
123 A 1 2023-01-01 2
123 A 2 2023-01-02 3
123 C 3 2023-01-03 8
123 A 4 2023-01-04 3
123 A 5 2023-01-05 1
123 B 6 2023-01-06 1
123 C 8 2023-01-07 5
Sum the quantity for all the Bs that happen before some A, and all the Cs that happen after some A?
Meaning in this case, for B we will sum up only the first line and for C we will sump up both cases.
ID event_type quantity
123 B 3
123 C 15
Sum the quantity for all the Bs that happen before all the As and all the Cs that happen exclusively after all As?
ID event_type quantity
123 B 3
123 C 8
For the second case, I can look at the min and max date of event A and then compare it against the dates of C and B. I am not sure about the first case when events can be followed by one another.
TableA
ID
Counter
Value
1
1
10
1
2
28
1
3
34
1
4
22
1
5
80
2
1
15
2
2
50
2
3
39
2
4
33
2
5
99
TableB
StartDate
EndDate
2020-01-01
2020-01-11
2020-01-02
2020-01-12
2020-01-03
2020-01-13
2020-01-04
2020-01-14
2020-01-05
2020-01-15
2020-01-06
2020-01-16
TableC (output)
ID
Counter
StartDate
EndDate
Val
1
1
2020-01-01
2020-01-11
10
2
1
2020-01-01
2020-01-11
15
1
2
2020-01-02
2020-01-12
28
2
2
2020-01-02
2020-01-12
50
1
3
2020-01-03
2020-01-13
34
2
3
2020-01-03
2020-01-13
39
1
4
2020-01-04
2020-01-14
22
2
4
2020-01-04
2020-01-14
33
1
5
2020-01-05
2020-01-15
80
2
5
2020-01-05
2020-01-15
99
1
1
2020-01-06
2020-01-16
10
2
1
2020-01-06
2020-01-16
15
I am attempting to come up with some SQL to create TableC. What TableC is, it takes the data from TableB, in chronological order, and for each ID in tableA, it finds the next counter in the sequence, and assigns that to the Start/End date combination for that ID, and when it reaches the end of the counter, it will start back at 1.
Is something like this even possible with SQL?
Yes this is possible. Try to do the following:
Calculate maximal value for Counter in TableA using SELECT MAX(Counter) ... into max_counter.
Add identifier row_number to each row in TableB so it will be able to find matching Counter value using SELECT ROW_NUMBER() OVER() ....
Establish relation between row number in TableB and Counter in TableA like this ... FROM TableB JOIN TableA ON (COALESCE(NULLIF(TableB.row_number % max_counter = 0), max_counter)) = TableA.Counter.
Then gather all these queries using CTE (Common Table Expression) into one query as official documentation shows.
Consider below approach
select id, counter, StartDate, EndDate, value
from tableA
join (
select *, mod(row_number() over(order by StartDate) - 1, 5) + 1 as counter
from tableB
)
using (counter)
if applied to sample data in your question - output is
I have two tables that I am trying to join. The tables have a primary and foreign key but there are some instances where the keys don't match and I need to join on the next best match.
I tried to use a case statement and it works but because the join isn't perfect. It will either grab the incorrect value or duplicate the record.
The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
I need a way to match on the IDs and then the SQL stops matching on that row. But im not sure if that is something BigQuery can do.
Customer Table
Cust_ID Cust_InfoID Cust_name Cust_Start Cust_Lev1
1111 1 Amy 2021-01-01 A
1112 3 John 2020-01-01 D
1113 8 Bill 2020-01-01 D
Info Table
Info_ID Info_Lev1 Info_Start Info_End state
1 A 2021-01-15 2021-01-14 NJ
3 D 2020-01-01 2020-12-31 NY
5 A 2021-01-01 2022-01-31 CA
Expected Result
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1112 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
Join Idea 1:
CASE
WHEN
(Cust_InfoID = Info_ID) = true
AND (Cust_Start BETWEEN Info_Start AND Info_End) = true
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the dates don't so it uses the ELSE statement to join. This is incorrect
Join Idea 2:
CASE
WHEN
Cust_InfoID = Info_ID
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the ELSE statement also matches up the wrong duplicate row. This is also incorrect
Example tables here:
with customer as (
SELECT 1111 Cust_ID,1 Cust_InfoID,'Amy' Cust_name,'2021-01-01' Cust_Start,'A' Cust_Lev1
UNION ALL
SELECT 1112,3,'John','2020-01-01','D'
union all
SELECT 1113,8,'Bill','2020-01-01','D'
),
info as (
select 1 Info_ID,'A' Info_Lev1,'2021-01-15' Info_Start,'2021-01-14' Info_End,'NJ' state
union all
select 3,'D','2020-01-01','2020-12-31','NY'
union all
select 5,'A','2021-01-01','2022-01-31','CA'
)
select Cust_ID,Cust_InfoID,Info_ID,Cust_Lev1,Cust_Start,Info_Start,Info_End,state
from customer
join info on
[case statement here]
The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
Use two left joins, one for each of the conditions:
select c.*,
coalesce(ii.info_start, il.info_start),
coalesce(ii.info_end, il.info_end),
coalesce(ii.state, il.state)
from customer c left join
info ii
on c.cust_infoid = ii.info_id left join
info il
on ii.info_id is null and
c.cust_lev1 = il.info_lev1 and
c.cust_start between il.info_start and il.info_end
Consider below ("with one JOIN and a CASE statement" as asked)
select any_value(c).*,
array_agg(i order by
case when c.cust_infoid = i.info_id then 1 else 2 end
limit 1
)[offset(0)].*
from `project.dataset.customer` c
join `project.dataset.info` i
on c.cust_infoid = i.info_id
or(
c.cust_lev1 = i.info_lev1 and
c.cust_start between i.info_start and i.info_end
)
group by format('%t', c)
when applied to sample data in your question - output is
I have this dataset:
product customer date value buyer_position
A 123455 2020-01-01 00:01:01 100 1
A 123456 2020-01-02 00:02:01 100 2
A 523455 2020-01-02 00:02:05 100 NULL
A 323455 2020-01-03 00:02:07 100 NULL
A 423455 2020-01-03 00:09:01 100 3
B 100455 2020-01-01 00:03:01 100 1
B 999445 2020-01-01 00:04:01 100 NULL
B 122225 2020-01-01 00:04:05 100 2
B 993848 2020-01-01 10:04:05 100 3
B 133225 2020-01-01 11:04:05 100 NULL
B 144225 2020-01-01 12:04:05 100 4
The dataset has the product the company sells and the customers who saw the product. A customer can see more than one product, but the combination product + customer doesn't have any repetition. I want to get how many people bought the product before the customer sees it.
This would be the perfect output:
product customer date value buyer_position people_before
A 123455 2020-01-01 00:01:01 100 1 0
A 123456 2020-01-02 00:02:01 100 2 1
A 523455 2020-01-02 00:02:05 100 NULL 2
A 323455 2020-01-03 00:02:07 100 NULL 2
A 423455 2020-01-03 00:09:01 100 3 2
B 100455 2020-01-01 00:03:01 100 1 0
B 999445 2020-01-01 00:04:01 100 NULL 1
B 122225 2020-01-01 00:04:05 100 2 1
B 993848 2020-01-01 10:04:05 100 3 2
B 133225 2020-01-01 11:04:05 100 NULL 3
B 144225 2020-01-01 12:04:05 100 4 3
As you can see, when the customer 122225 saw the product he wanted, two people have already bought it. In the case of customer 323455, two people have already bought the product A.
I think I should use some window function, like lag(). But lag() function won't get this "cumulative" information. So I'm kind of lost here.
This looks like a window count of non-null values of buyer_position over the preceding rows:
select t.*,
coalesce(count(buyer_position) over(
partition by product
order by date
rows between unbounded preceding and 1 preceding
), 0) as people_before
from mytable t
Hmmm . . . If I understand correctly, You want the max of the buyer position for the customer/product minus 1:
select t.*,
max(buyer_position) over (partition by customer, product order by date rows between unbounded preceding and current row) - 1
from t;
I am using Oracle SQL.
Here is the example table:
MachineStatus
--------------
Machine Status date_time station
G1 1 07/09/2014 10:11 s1
G2 1 07/09/2014 10:11 s1
G3 0 07/09/2014 10:11 s1
G1 1 07/09/2014 10:12 s1
G2 1 07/09/2014 10:12 s1
G3 0 07/09/2014 10:12 s1
G1 0 07/09/2014 10:13 s1
G2 0 07/09/2014 10:13 s1
G3 0 07/09/2014 10:13 s1
I want to list the status of the station on any given minute as available if any of the machines is available(if status is 1) as below.
Station status date_time
s1 1 07/09/2014 10:11
s1 1 07/09/2014 10:12
s1 0 07/09/2014 10:13
A report needs to be generated on a daily/weekly basis based on the availability.
How should I approach this query?
The following query will give you the result from your example:
SELECT
station,
(CASE WHEN sum(status) > 1 THEN 1 ELSE 0 END) AS status,
date_time
FROM MachineStatus
GROUP BY station, date_time
ORDER BY date_time
Having at least one status that is 1 means that the sum of statuses for that group and date_time must be greater than 1.