Replace value with last value where flag was set to 1 - sql

I have a table where each row contains all the fields that changed during some event, and a flag associated with each field to flag if the field was updated. For simplicity I only show here the "status" field, but they are several other fields as well.
In cases where a given field was not modified by the event, the field is set to null and so is the flag.
+----+---------------------+--------+---------------------+
| id | date | status | flag_changed_status |
+----+---------------------+--------+---------------------+
| 1 | 2020-01-03 19:32:17 | TODO | 1 |
| 1 | 2020-01-08 15:46:07 | WIP | 1 |
| 1 | 2020-01-08 15:53:53 | | | //this line was generated because another field changed
| 1 | 2020-01-08 15:56:53 | | | //this line was generated because another field changed
| 1 | 2020-01-08 16:02:31 | Done | 1 |
+----+---------------------+--------+---------------------+
My goal is to replace the field values for the rows where the field was not changed with the last value it had when the flag was equal to one, e.g get :
+----+---------------------+--------+---------------------+
| id | date | status | flag_changed_status |
+----+---------------------+--------+---------------------+
| 1 | 2020-01-03 19:32:17 | TODO | 1 |
| 1 | 2020-01-08 15:46:07 | WIP | 1 |
| 1 | 2020-01-08 15:53:53 | WIP | |
| 1 | 2020-01-08 15:56:53 | WIP | |
| 1 | 2020-01-08 16:02:31 | Done | 1 |
+----+---------------------+--------+---------------------+
I understand that I want to use the last_value analytical function in Bigquery, and I tried :
SELECT ID_DEMANDE, date, status,
last_value(status) OVER (ORDER BY flag_changed_status, DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as current_status, flag_changed_status
FROM table ORDER BY id, DATE
The idea was that by using the flag in the order by function, the rows where the flag was set to null would be put in first, and then the last_value(status) would be the last value where flag_changed_status was set to 1
But this can only work if I use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, because the ORDER BY clause will be processed before the window frame clause (rows between ...), thus for the rows where flag_changed_status is null, after the order by is processed, the current row number is 0, so the last value between unbounded preceding and current row is always null.
Is there any way to first run the ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING and then the ORDER BY, so that last_value(status) will return the last value preceding the current row where the flag was set to one ? Or is there something much simpler, still using analytical functions to allow me to complete all the different fields in one query ?
Edit :
I really want to copy the status that was set the last time the flag was set, even if this status is null, that is why I am trying to use the flag in the order by. That is if the initial table is :
+----+---------------------+--------+---------------------+
| id | date | status | flag_changed_status |
+----+---------------------+--------+---------------------+
| 1 | 2020-01-03 19:32:17 | TODO | 1 |
| 1 | 2020-01-08 15:46:07 | null | 1 |
| 1 | 2020-01-08 15:53:53 | null | null |
| 1 | 2020-01-08 15:56:53 | null | null |
| 1 | 2020-01-08 15:57:53 | WIP | 1 |
| 1 | 2020-01-08 15:58:53 | null | null |
| 1 | 2020-01-08 16:02:31 | Done | 1 |
+----+---------------------+--------+---------------------+
I would need:
+----+---------------------+--------+---------------------+
| id | date | status | flag_changed_status |
+----+---------------------+--------+---------------------+
| 1 | 2020-01-03 19:32:17 | TODO | 1 |
| 1 | 2020-01-08 15:46:07 | null | 1 |
| 1 | 2020-01-08 15:53:53 | null | null | // we copy the last status where the flag was 1, and it is null
| 1 | 2020-01-08 15:56:53 | null | null |
| 1 | 2020-01-08 15:57:53 | WIP | 1 |
| 1 | 2020-01-08 15:58:53 | WIP | null | //only this line changes
| 1 | 2020-01-08 16:02:31 | Done | 1 |
+----+---------------------+--------+---------------------+
But it seems to be too complicated, so I will just replace all the nulls where the flag is set to 1 with a custom status, and then a simple last_value(status IGNORE NULLS) as #gordon-linoff was suggesting will provide almost the desired result

Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(grp),
LAST_VALUE(status IGNORE NULLS) OVER (PARTITION BY grp ORDER BY date) AS updated_status
FROM (
SELECT *,
COUNTIF(flag_changed_status = 1) OVER(ORDER BY `date`) grp
FROM `project.dataset.table`
)
if to apply to sample data from your question - result is
Row id date status flag_changed_status updated_status
1 1 2020-01-03 19:32:17 TODO 1 TODO
2 1 2020-01-08 15:46:07 null 1 null
3 1 2020-01-08 15:53:53 null null null
4 1 2020-01-08 15:56:53 null null null
5 1 2020-01-08 15:57:53 WIP 1 WIP
6 1 2020-01-08 15:58:53 null null WIP
7 1 2020-01-08 16:02:31 Done 1 Done

I prefer lag(ignore nulls). But BigQuery doesn't support that. Instead, use first_value()/last_value():
with t as (
select 1 as id, '2020-01-03 19:32:17' as date, 'TODO' as status, 1 as file_changed_status union all
select 1 as id, '2020-01-08 15:46:07' as date, 'WIP ' as status, 1 as file_changed_status union all
select 1 as id, '2020-01-08 15:53:53' as date, null as status, null as file_changed_status union all
select 1 as id, '2020-01-08 15:56:53' as date, null as status, null as file_changed_status union all
select 1 as id, '2020-01-08 16:02:31' as date, 'Done' as status, 1 as file_changed_status
)
select t.*,
last_value(status ignore nulls) over (order by date) as imputed_status
from t;

Related

How to use variable lag window functions?

I have a table with the following schema:
CREATE TABLE example (
userID,
status, --'SUCCESS' or 'FAIL'
date -- self explanatory
);
INSERT INTO example
Values(123, 'SUCCESS', 20211010),
(123, 'SUCCESS', 20211011),
(123, 'SUCCESS', 20211028),
(123, 'FAIL', 20211029),
(123, 'SUCCESS', 20211105),
(123, 'SUCCESS', 20211110)
I am trying to utilize a lag or lead function to assess whether the current line is within a 2-week window of the previous 'SUCCESS'. Given the current data, I would expect a isWithin2WeeksofSuccessFlag to be as following:
123, 'SUCCESS', 20211010,0 --since it is the first instance
123, 'SUCCESS', 20211011,1
123, 'SUCCESS', 20211028,1
123, 'FAIL', 20211029, 1 --failed, but criteria is that it is within 2 weeks of last success, which it is
123, 'SUCCESS', 20211105, 1 --last success is 2 rows back, but it is within 2 weeks
123, 'SUCCESS', 20211128, 0 --outside of 2 weeks
I would initially think to do something like this:
Select userID, status, date,
case when lag(status,1) over (partition by userid order by date asc) = 'SUCCESS'
and date_add('day',-14, date) <= lag(date,1) over (partition by userid order by date asc)
then 1 end as isWithin2WeeksofSuccessFlag
from example
This would work if I didn't have the 'FAIL' line in there. To handle it, I could modify the lag to 2 (instead of 1), but what about if I have 2,3,4,n 'FAIL's in a row? I would need to lag by 3,4,5,n+1. The specific number of FAILs in between is variable. How do I handle this variability?
NOTE I am querying billions of rows. Efficiency isn't really a concern (since it is for analysis), but running into memory allocation errors is.Thus, endlessly adding more window functions would likely cause an automatic termination of the query due memory requirement being above node limit.
How should I handle this?
Here's an approach, also using window functions, with each "common table expression" handling one step at a time.
Note: The expected result in the question does not match the data in the question. '20211128' doesn't exist in the actual data. I used the example INSERT statement.
In the test case, I changed the column name to xdate to avoid any potential SQL reserved word issues.
The SQL:
WITH cte1 AS (
SELECT *
, SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) OVER (PARTITION BY userID ORDER BY xdate) AS grp
FROM example
)
, cte2 AS (
SELECT *
, MAX(CASE WHEN status = 'SUCCESS' THEN xdate END) OVER (PARTITION BY userID, grp) AS lastdate
FROM cte1
)
, cte3 AS (
SELECT *
, CASE WHEN LAG(lastdate) OVER (PARTITION BY userID ORDER BY xdate) > (xdate - INTERVAL '2' WEEK) THEN 1 ELSE 0 END AS isNear
FROM cte2
)
SELECT * FROM cte3
ORDER BY userID, xdate
;
The result:
+--------+---------+------------+------+------------+--------+
| userID | status | xdate | grp | lastdate | isNear |
+--------+---------+------------+------+------------+--------+
| 123 | SUCCESS | 2021-10-10 | 1 | 2021-10-10 | 0 |
| 123 | SUCCESS | 2021-10-11 | 2 | 2021-10-11 | 1 |
| 123 | SUCCESS | 2021-10-28 | 3 | 2021-10-28 | 0 |
| 123 | FAIL | 2021-10-29 | 3 | 2021-10-28 | 1 |
| 123 | SUCCESS | 2021-11-05 | 4 | 2021-11-05 | 1 |
| 123 | SUCCESS | 2021-11-10 | 5 | 2021-11-10 | 1 |
+--------+---------+------------+------+------------+--------+
and with the data adjusted to match your expected result, plus a new user introduced, the result is this:
+--------+---------+------------+------+------------+--------+
| userID | status | xdate | grp | lastdate | isNear |
+--------+---------+------------+------+------------+--------+
| 123 | SUCCESS | 2021-10-10 | 1 | 2021-10-10 | 0 |
| 123 | SUCCESS | 2021-10-11 | 2 | 2021-10-11 | 1 |
| 123 | SUCCESS | 2021-10-28 | 3 | 2021-10-28 | 0 |
| 123 | FAIL | 2021-10-29 | 3 | 2021-10-28 | 1 |
| 123 | SUCCESS | 2021-11-05 | 4 | 2021-11-05 | 1 |
| 123 | SUCCESS | 2021-11-28 | 5 | 2021-11-28 | 0 |
| 323 | SUCCESS | 2021-10-10 | 1 | 2021-10-10 | 0 |
| 323 | SUCCESS | 2021-10-11 | 2 | 2021-10-11 | 1 |
| 323 | SUCCESS | 2021-10-28 | 3 | 2021-10-28 | 0 |
| 323 | FAIL | 2021-10-29 | 3 | 2021-10-28 | 1 |
| 323 | SUCCESS | 2021-11-05 | 4 | 2021-11-05 | 1 |
| 323 | SUCCESS | 2021-11-28 | 5 | 2021-11-28 | 0 |
+--------+---------+------------+------+------------+--------+
Here's an extra test case, which might expose problems in some solutions:
INSERT INTO example VALUES
(123, 'SUCCESS', '2021-10-11')
, (123, 'FAIL' , '2021-10-12')
, (123, 'FAIL' , '2021-10-13')
;
The result:
+--------+---------+------------+------+------------+--------+
| userID | status | xdate | grp | lastdate | isNear |
+--------+---------+------------+------+------------+--------+
| 123 | SUCCESS | 2021-10-11 | 1 | 2021-10-11 | 0 |
| 123 | FAIL | 2021-10-12 | 1 | 2021-10-11 | 1 |
| 123 | FAIL | 2021-10-13 | 1 | 2021-10-11 | 1 |
+--------+---------+------------+------+------------+--------+
If your DBMS doesn't support window function filters you can order by status desc so 'SUCCESS' goes before 'FAIL'.
select userID, status, date,
case when lag(status,1) over (partition by userid order by status desc , date asc) = 'SUCCESS'
and dateadd(d, -14, date) <= lag(date,1) over (partition by userid order by status desc , date asc)
then 1 end as isWithin2WeeksofSuccessFlag
from example
order by date
Sql Server fiddle

Adding indicator column to table based on having two consecutive days within group

I need to add a logic that helps me to flag the first of two consecutive days as 1 and the second day as 0 grouped by a column (test). If a test (a) has three consecutive days then the third should start with 1 again etc.
Example table would be like following with new col being the column I need.
|---------------------|------------------|---------------------|
| test | test_date | new col |
|---------------------|------------------|---------------------|
| a | 1/1/2020 | 1 |
|---------------------|------------------|---------------------|
| a | 1/2/2020 | 0 |
|---------------------|------------------|---------------------|
| a | 1/3/2020 | 1 |
|---------------------|------------------|---------------------|
| b | 1/1/2020 | 1 |
|---------------------|------------------|---------------------|
| b | 1/2/2020 | 0 |
|---------------------|------------------|---------------------|
| b | 1/15/2020 | 1 |
|---------------------|------------------|---------------------|
As it seems to be some gaps-and-islands problem and I assume some windows function approach should get me there.
I tried something like following to get the consecutive part but struggle with the indicator column.
Select
test,
test_date,
grp_var = dateadd(day,
-row_number() over (partition by test order by test_date), test_date)
from
my_table
This does read as a gaps-and-island problem. I would recommend using the difference between row_number() and the date to generate the groups, and then arithmetic:
select
test,
test_date,
row_number() over(
partition by test, dateadd(day, -rn, test_date)
order by test_date
) % 2 new_col
from (
select
t.*,
row_number() over(partition by test order by test_date) rn
from mytable t
) t
Demo on DB Fiddle:
test | test_date | new_col
:--- | :--------- | ------:
a | 2020-01-01 | 1
a | 2020-01-02 | 0
a | 2020-01-03 | 1
b | 2020-01-01 | 1
b | 2020-01-02 | 0
b | 2020-01-15 | 1

How to select the latest date for each group by number?

I've been stuck on this question for a while, and I was wondering if the community would be able to direct me in the right direction?
I have some tag IDs that needs to be grouped, with exceptions (column: deleted) that need to be retained in the results. After which, for each grouped tag ID, I need to select the one with the latest date. How can I do this? An example below:
ID | TAG_ID | DATE | DELETED
1 | 300 | 05/01/20 | null
2 | 300 | 03/01/20 | 04/01/20
3 | 400 | 06/01/20 | null
4 | 400 | 05/01/20 | null
5 | 400 | 04/01/20 | null
6 | 500 | 03/01/20 | null
7 | 500 | 02/01/20 | null
I am trying to reach this outcome:
ID | TAG_ID | DATE | DELETED
1 | 300 | 05/01/20 | null
2 | 300 | 03/01/20 | 04/01/20
3 | 400 | 06/01/20 | null
6 | 500 | 03/01/20 | null
So, firstly if there is a date in the "DELETED" column, I would like the row to be present. Secondly, for each unique tag ID, I would like the row with the latest "DATE" to be present.
Hopefully this question is clear. Would appreciate your feedback and help! A big thanks in advance.
Your results seem to be something like this:
select t.*
from (select t.*,
row_number() over (partition by tag_id, deleted order by date desc) as seqnum
from t
) t
where seqnum = 1 or deleted is not null;
This takes one row where deleted is null -- the most recent row. It also keeps each row where deleted is not null.
You need 2 conditions combined with OR in the WHERE clause:
the 1st is deleted is not null, or
the 2nd that there isn't any other row with the same tag_id and date later than the current row's date, meaning that the current row's date is the latest:
select t.* from tablename t
where t.deleted is not null
or not exists (
select 1 from tablename
where tag_id = t.tag_id and date > t.date
)
See the demo.
Results:
| id | tag_id | date | deleted |
| --- | ------ | ---------- | -------- |
| 1 | 300 | 2020-05-01 | |
| 2 | 300 | 2020-03-01 | 04/01/20 |
| 3 | 400 | 2020-06-01 | |
| 6 | 500 | 2020-03-01 | |

How do i get the latest user udpated column value in a table based on timestamp entry on a different table in SQL Server?

I have a temp table #StatusInfo with the following data
+---------+--------------+-------+-------------------------+--+
| OrderNo | GroupLineNum | Type1 | UpdateDate | |
+---------+--------------+-------+-------------------------+--+
| Order85 | NULL | 1 | 2019-11-25 05:15:55.000 | |
+---------+--------------+-------+-------------------------+--+
| Order86 | NULL | 1 | 2019-11-25 05:15:55.000 | |
+---------+--------------+-------+-------------------------+--+
| Order86 | 2 | 2 | 2019-11-25 05:32:23.773 | |
+---------+--------------+-------+-------------------------+--+
| Order87 | NULL | 1 | 2019-11-25 05:15:55.000 | |
+---------+--------------+-------+-------------------------+--+
| Order87 | 1 | 2 | 2019-11-25 05:43:37.637 | | B
+---------+--------------+-------+-------------------------+--+
| Order87 | 2 | 2 | 2019-11-25 05:42:32.390 | | A
+---------+--------------+-------+-------------------------+--+
| Order88 | NULL | 1 | 2019-11-25 06:35:13.000 | |
+---------+--------------+-------+-------------------------+--+
| Order88 | 1 | 2 | 2019-11-25 06:39:16.170 | |
+---------+--------------+-------+-------------------------+--+
Any update the user does on an order will be pulled into this temp table. Type 1 column with value 2 denotes a 'Required Date' field change by the user. The timestamp when the user made the change is the last column.
I have another temp table #LineInfo with the following data. This table is created by joining other tables and a left join with the above table too. The 'LineNum' column from below table will match the 'GroupLineNum' column in the above table for Type1=2
+---------+-----------+---------+------------+-------------------------+-------+
| OrderNo | RowNumber | LineNum | TotalCost | ReqDate | Type1 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order85 | 1 | 1 | 309.110000 | 2019-10-30 23:59:00.000 | 1 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order85 | 2 | 2 | 265.560000 | 2019-10-30 23:59:00.000 | 1 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order86 | 1 | 1 | 309.110000 | 2019-10-30 23:59:00.000 | 1 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order86 | 2 | 2 | 265.560000 | 2019-12-28 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order87 | 1 | 1 | 309.110000 | 2020-01-31 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order87 | 2 | 2 | 265.560000 | 2020-01-01 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order88 | 1 | 1 | 309.110000 | 2019-11-29 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order88 | 2 | 2 | 265.560000 | 2019-12-31 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
I will be joining #lineInfo with other tables to generate a new table with only one record for an orderno. Its grouped by orderno.
What I need to do is ensure that the new selectquery will have a column 'ReqDate' which will be the latest ReqDate value for the order.
For example, Order87 has two lines in the order. User updated Line 2 first at '2019-11-25 05:42:32.390' as seen in the row marked 'A' followed by Line 1 marked B # '2019-11-25 05:43:37.637 ' from the first table.
The new query should have the data from LineInfo and only the 'ReqDate' value matching the 'LineNum' that has the maximum of 'UpdateDate' column for Type1=2 and group by orderno.
So in our example, the output should have the ReqDate value '2020-01-31 23:59:00.000'.
In short, an order should have the most recently updated required date. Order can have multiple line items where reqdate is udpated. If there is no entry in #StatusInfo table with Type2 for an order, then any one of the ReqDate value from the #LineInfo table will suffice. Maybe the first line
I wrote something like this but it doesnt pull orders without any entry in StatusInfo table. Those orders will have a default value even though user didnt udpate and i am not sure how to join the result of this with LineInfo table to set the latest value
Select SIT.Orderno, max_date,grouplinenum
from #StatusInfo SIT
inner join
(SELECT Orderno, MAX(ActDate) as max_date
FROM #StatusInfo SI
WHERE SI.Type1=2
GROUP BY SI.Orderno)a
on a.Orderno = SIT.Orderno and a.max_date = SIT.ActDate
This is what I did. I created the blow CTE to load orders with req date change in order of Updated date and assigned it row number. Record with row number 1 will be the most recently updated date
;WITH cteLatestReqDate AS ( --We need to pull the latest ReqDate value the user set. So we are are ordering the SIT table by ActDate and assigning a row number and respective line's required date here
SELECT SIT.OrderNo, SIT.UpdateDate, SIT.GroupLineNum, LLI.ReqDate,
ROW_NUMBER() OVER (PARTITION BY SIT.OrderNo ORDER BY ActDate DESC) AS RowNum
FROM #StatusInfo SIT INNER JOIN #LineLevelInfo LLI ON SIT.OrderNo = OI.OrderNo AND SIT.GroupLineNum = LLI.LineNum
WHERE SIT.Type1 = 2
)
and then I added the below condition to my select query. Below select query is partial
SELECT
CASE WHEN MAX(LRD.ReqDate) IS NULL THEN CAST(FORMAT(MAX(LLI.ReqDate), 'yyMMdd') AS NVARCHAR(10))
ELSE CAST(FORMAT(MAX(LRD.ReqDate), 'yyMMdd') AS NVARCHAR(10)) END AS LatestReqDate
FROM #LineLevelInfo LLI
LEFT JOIN(SELECT * FROM cteLatestReqDate WHERE RowNum = 1)LRD ON LRD.OrderNo = LLI.OrderNo And LRD.GroupLineNum = LLI.LineNum

In Redshift, how do I run the opposite of a SUM function

Assuming I have a data table
date | user_id | user_last_name | order_id | is_new_session
------------+------------+----------------+-----------+---------------
2014-09-01 | A | B | 1 | t
2014-09-01 | A | B | 5 | f
2014-09-02 | A | B | 8 | t
2014-09-01 | B | B | 2 | t
2014-09-02 | B | test | 3 | t
2014-09-03 | B | test | 4 | t
2014-09-04 | B | test | 6 | t
2014-09-04 | B | test | 7 | f
2014-09-05 | B | test | 9 | t
2014-09-05 | B | test | 10 | f
I want to get another column in Redshift which basically assigns session numbers to each users session. It starts at 1 for the first record for each user and as you move further down, if it encounters a true in the "is_new_session" column, it increments. Stays the same if it encounters a false. If it hits a new user, the value resets to 1. The ideal output for this table would be:
1
1
2
1
2
3
4
4
5
5
In my mind it's kind of the opposite of a SUM(1) over (Partition BY user_id, is_new_session ORDER BY user_id, date ASC)
Any ideas?
Thanks!
I think you want an incremental sum:
select t.*,
sum(case when is_new_session then 1 else 0 end) over (partition by user_id order by date) as session_number
from t;
In Redshift, you might need the windowing clause:
select t.*,
sum(case when is_new_session then 1 else 0 end) over
(partition by user_id
order by date
rows between unbounded preceding and current row
) as session_number
from t;