Why isn't my code to select the latest non null value for a column and fill in the nulls not working? - google-bigquery

I have a table T1 that has daily snapshots of products and their statuses:
SnapshotDate
ProductId
Status
2022-01-03
1
Sold
2022-01-02
1
Pending
2022-01-01
1
In_Stock
2022-01-03
2
Null
2022-01-02
2
Null
2022-01-01
2
Sold
2022-01-03
3
Null
2022-01-02
3
Null
2022-01-01
3
Pending
I want to write code that detects the latest status for each product and fills it in for every row, in cases where it's null it would select the latest option that is not null. The final output should be:
SnapshotDate
ProductId
Status
2022-01-03
1
Sold
2022-01-02
1
Sold
2022-01-01
1
Sold
2022-01-03
2
Sold
2022-01-02
2
Sold
2022-01-01
2
Sold
2022-01-03
3
Pending
2022-01-02
3
Pending
2022-01-01
3
Pending
I wrote this code but it does not work for some reason, does anybody know why and how I can fix it?
SELECT
SnapshotDate,
ProductId,
COALESCE(Status,
LAG(Status) OVER (PARTITION BY ProductId ORDER BY SnapshotDate DESC)
) AS Status
FROM
T1
I've also tried Lead and that does not work either

Try using LAST_VALUE + IGNORE NULLS modifier, instead of LAG.
SELECT SnapshotDate,
ProductId,
COALESCE(Status,
LAST_VALUE(Status IGNORE NULLS) OVER (
PARTITION BY ProductId
ORDER BY SnapshotDate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
) AS Status
FROM T1

Related

Is there a way to get the most recent and the original record with a number of conditions included using SQL

I want to get the most recent data and also the original data for each group in a table but with a set of conditions.
Below is the current structure of dataset/table.
Each group can have multiple items
Each item_id can have the same item_name and these are known as change item_names with one significant difference the (). The number inside defines how many iterations of changes are made.
Each item_id can have multiple status but for the example below it is simplified to only 2 status Draft->Approved.
group
date
item_id
item_name
status
price
stock
A
2022-01-01
36FG-34-45
AB-1234
Draft
15
100
B
2022-01-02
28AE-23-67
CD-4567
Approved
30
120
A
2022-01-05
45RE-12-99
DE-1234
Approved
20
300
C
2022-01-07
78ED-14-88
EA-4532
Draft
10
500
B
2022-01-05
45AB-16-77
CD-4567(1)
Draft
35
200
A
2022-01-03
76JJ-98-66
DE-1234(1)
Approved
50
250
A
2022-02-02
17KL-10-43
DE-1234(2)
Draft
12
400
C
2022-03-03
97EE-42-17
AE-2468
Approved
25
450
The output required: take the most recent item_id for each group & when involved in the change process and the status is not equal to approve then take the most recent item_id that has been approved for each group.
Also to note it won't necessarily be the second most recent record per group that is approved can be further back in the timeline and process.
group
date
item_id
item_name
status
price
stock
original_item_id
original_item_name
original_status
original_price
original_stock
A
2022-02-02
17KL-10-43
DE-1234(2)
Draft
12
400
76JJ-98-66
DE-1234(1)
Approved
50
250
B
2022-01-05
28AE-23-67
CD-4567(1)
Draft
35
200
45AB-16-77
CD-4567
Approved
30
120
C
2022-03-03
97EE-42-17
AE-2468
Approved
25
450
NULL
NULL
NULL
NULL
NULL
Your example output for group A shows the original item name (most recent item name for that group that was approved) as DE-1234(1). This has a date of 1/3/2022, however item name DE-1234 has a date of 1/5/2022 making it the most recent item id that was approved from group A. Because of that, my output differs from yours for that reference.
Here is a link to the SQL Fiddle where I recreated this.
Here is the query I created for this:
First we create a CTE that ranks your items by group to get the most recent per group.
WITH cte AS--rank records by group ordered by date DESC
(
SELECT
[group]
,[date]
,item_id
,item_name
,status
,price
,stock
,ROW_NUMBER() OVER (PARTITION BY [group] ORDER BY [date] DESC) AS rn
FROM t
)
Then we get filter the CTE to only approved and re-rank to get the most recently approved by group.
,cte2 AS--rank joined records where status = approved by group ordered by date DESC
(
SELECT
a.[group]
,a.[date]
,a.item_id
,a.item_name
,a.status
,a.price
,a.stock
,b.[group] AS original_group
,b.[date] AS original_date
,b.item_id AS original_item_id
,b.item_name AS original_item_name
,b.status AS original_status
,b.price AS original_price
,b.stock AS original_stock
,ROW_NUMBER() OVER (PARTITION BY a.[group] ORDER BY b.rn) rn--get most recent record that was approved
FROM cte a
LEFT JOIN cte b ON
a.[group] = b.[group]
AND b.rn > a.rn--b is a previous record
AND b.status = 'Approved'
WHERE a.rn = 1--Most recent item id
)
Lastly, we query cte2 filtering for only the most recent record that was approved
SELECT
[group]
,[date]
,item_id
,item_name
,status
,price
,stock
--,original_group
--,original_date
,original_item_id
,original_item_name
,original_status
,original_price
,original_stock
FROM cte2
WHERE rn = 1--filter for most recent record that was approved

Computing window functions for multiple dates

I have a table sales of consisting of user id's, products those users have purchased, and the date of purchase:
date
user_id
product
2021-01-01
1
apple
2021-01-02
1
orange
2021-01-02
2
apple
2021-01-02
3
apple
2021-01-03
3
orange
2021-01-04
4
apple
If I wanted to see product counts based on every users' most recent purchase, I would do something like this:
WITH latest_sales AS (
SELECT
date
, user_id
, product
, row_number() OVER(PARTITION BY user_id ORDER BY date DESC) AS rn
FROM
sales
)
SELECT
product
, count(1) AS count
FROM
latest_sales
WHERE
rn = 1
GROUP BY
product
Producing:
product
count
apple
2
orange
2
However, this will only produce results for my most recent date. If I looked at this on 2021-01-02. The results would be:
product
count
apple
2
orange
1
How could I code this so I could see counts of the most recent products purchased by user, but for multiple dates?
So the output would be something like this:
date
product
count
2021-01-01
apple
1
2021-01-01
orange
0
2021-01-02
apple
2
2021-01-02
orange
1
2021-01-03
apple
1
2021-01-03
orange
2
2021-01-04
apple
2
2021-01-04
orange
2
Appreciate any help on this.
I'm afraid the window function row_number() with the PARTITION BY user_id clause is not relevant in your case because it only focusses on the user_id of the current row whereas you want a consolidate view with all the users.
I dont have a better idea than doing a self-join on table sales :
WITH list AS (
SELECT DISTINCT ON (s2.date, user_id)
s2.date
, product
FROM sales AS s1
INNER JOIN (SELECT DISTINCT date FROM sales) AS s2
ON s1.date <= s2.date
ORDER BY s2.date, user_id, s1.date DESC
)
SELECT date, product, count(*)
FROM list
GROUP BY date, product
ORDER BY date
see the test result in dbfiddle

Count Distinct IDs in a date range given a start and end time

I have a BigQuery table like this
id
start_date
end_date
location
type
1
2022-01-01
2022-01-01
MO
mobile
1
2022-01-01
2022-01-02
MO
mobile
2
2022-01-02
2022-01-03
AZ
laptop
3
2022-01-03
2022-01-03
AZ
mobile
3
2022-01-03
2022-01-03
AZ
mobile
3
2022-01-03
2022-01-03
AZ
mobile
2
2022-01-02
2022-01-03
CA
laptop
4
2022-01-02
2022-01-03
CA
mobile
5
2022-01-02
2022-01-03
CA
laptop
I want to return the number of unique IDs by location and type of an arbitrary date range.
The issue I have is that there are multiple repeating lines covering similar dates, like the first two rows above.
For example, a date range of 2022-01-02 and 2022-01-03 would return
location
type
count distinct ID
AZ
laptop
1
AZ
mobile
1
CA
laptop
2
CA
mobile
1
MO
mobile
1
I first tried creating a list of dates in like
WITH dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY(DATE_TRUNC(DATE_SUB(CURRENT_DATE('Europe/London'), INTERVAL 3 MONTH),MONTH), DATE_SUB(CURRENT_DATE('Europe/London'), INTERVAL 1 DAY), INTERVAL 1 DAY)) AS cal_date)
and using ROW_NUMBER() OVER (PARTITION BY id,start_date,end_date) to expose only unique rows.
But I was only able to return the number of unique IDs for each day, rather than looking at the full date range as a whole.
I then tried joining to the same cte to return a unique row for each date, so something like
date | id | start_date | end_date | location | type
Where the columns from the first table above are duplicated for each date but this would require generating a huge number of rows to then further work with.
What is the correct way to acheive the desired result?
I think the simplest way is
select location, type, count(distinct id) distinct_ids
from your_table, unnest(generate_date_array(start_date, end_date)) effective_date
where effective_date between '2022-01-02' and '2022-01-03'
group by location, type
with output (if applied to sample data in your question)

How to filter out multiple downtime events in SQL Server?

There is a query I need to write that will filter out multiples of the same downtime event. These records get created at the exact same time with multiple different timestealrs which I don't need. Also, in the event of multiple timestealers for a downtime event I need to make the timestealer 'NULL' instead.
Example table:
Id
TimeStealer
Start
End
Is_Downtime
Downtime_Event
1
Machine 1
2022-01-01 01:00:00
2022-01-01 01:01:00
1
Malfunction
2
Machine 2
2022-01-01 01:00:00
2022-01-01 01:01:00
1
Malfunction
3
NULL
2022-01-01 00:01:00
2022-01-01 00:59:59
0
Operating
What I need the query to return:
Id
TimeStealer
Start
End
Is_Downtime
Downtime_Event
1
NULL
2022-01-01 01:00:00
2022-01-01 01:01:00
1
Malfunction
2
NULL
2022-01-01 00:01:00
2022-01-01 00:59:59
0
Operating
Seems like this is a top 1 row of each group, but with the added logic of making a column NULL when there are multiple rows. You can achieve that by also using a windowed COUNT, and then a CASE expression in the outer SELECT to only return the value of TimeStealer when there was 1 event:
WITH CTE AS(
SELECT V.Id,
V.TimeStealer,
V.Start,
V.[End],
V.Is_Downtime,
V.Downtime_Event,
ROW_NUMBER() OVER (PARTITION BY V.Start, V.[End], V.Is_Downtime,V.Downtime_Event ORDER BY ID) AS RN,
COUNT(V.ID) OVER (PARTITION BY V.Start, V.[End], V.Is_Downtime,V.Downtime_Event) AS Events
FROM(VALUES('1','Machine 1',CONVERT(datetime2(0),'2022-01-01 01:00:00'),CONVERT(datetime2(0),'2022-01-01 01:01:00'),'1','Malfunction'),
('2','Machine 2',CONVERT(datetime2(0),'2022-01-01 01:00:00'),CONVERT(datetime2(0),'2022-01-01 01:01:00'),'1','Malfunction'),
('3','NULL',CONVERT(datetime2(0),'2022-01-01 00:01:00'),CONVERT(datetime2(0),'2022-01-01 00:59:59'),'0','Operating'))V(Id,TimeStealer,[Start],[End],Is_Downtime,Downtime_Event))
SELECT ROW_NUMBER() OVER (ORDER BY ID) AS ID,
CASE WHEN C.Events = 1 THEN C.TimeStealer END AS TimeStealer,
C.Start,
C.[End],
C.Is_Downtime,
C.Downtime_Event
FROM CTE C
WHERE C.RN = 1;

Add temporary column with number in sequence in BigQuery

I have two columns: customers and orders. orders has customer_id column. So customer can have many orders. I need to find order number in sequence (by date). So result should be something like this:
customer_id order_date number_in_sequence
----------- ---------- ------------------
1 2020-01-01 1
1 2020-01-02 2
1 2020-01-03 3
2 2019-01-01 1
2 2019-01-02 2
I am going to use it in WITH clause. So I don't need to add it to the table.
You need row_number() :
select t.*,
row_number() over (partition by customer_id order by order_date) as number_in_sequence
from table t;