bigquery get highest possible steps group by col - sql

i have question about counting row number based on a column iteration
my table looks like this
time | steps | name
13:02 | 0 | a
13:03 | 0 | a
13:04 | 1 | a
13:05 | 0 | a
13:07 | 1 | a
13:10 | 1 | a
13:12 | 2 | a
13:04 | 0 | b
13:06 | 0 | b
13:12 | 1 | b
13:14 | 2 | b
13:19 | 3 | b
13:14 | 0 | b
13:19 | 3 | b
from table above i want to get the highest possible steps made by name. but must meet these condition:
steps made by name must be sequential(ex: 0,1,2,3 return 0,1,2,3; 0,1,2,4 return 0,1,2)
each step must be sequential according to time
Select any value if there are more than 1 record is possible(ex: 0,1,1,2 return 0,ANY(1,1),2)
table i looking for is
time | steps | name
13:05 | 0 | a
13:07 | 1 | a
13:12 | 2 | a
13:06 | 0 | b
13:12 | 1 | b
13:14 | 2 | b
13:19 | 3 | b
Is there any way i can do this in bigquery?

First remove duplicates. Then identify the rows where the "next" step (by time) is what you expect.
The following almost works:
select t.*
from (select min(time) as time, steps, name,
lead(steps) over (partition by name order by min(time)) as next_step
from yourtable t
group by steps, name
) t
where next_step = step + 1;
However, you want the minimum set. For that, you also need for the row number to match. It turns out that that condition is sufficient:
select t.*
from (select min(time) as time, steps, name,
row_number() over (partition by name order by min(time)) as seqnum
from yourtable t
group by steps, name
) t
where step = seqnum - 1;

Related

BigQuery: delete duplicated row that are not fully duplicated (delete desire row)

I have a table recording customer step on daily basis. The table had Id, date and step column. Some rows contained different steps on the same day for the same Id. Sample as shown below on 5/3/2020 and 5/4/2020 for Id 1:
| Id | Date | Step |
|:-----|:---------|:-----|
| 1 | 5/1/2020 | 1 |
| 1 | 5/2/2020 | 1 |
| 1 | 5/3/2020 | 0 |
| 1 | 5/3/2020 | 5 |
| 1 | 5/4/2020 | 2 |
| 1 | 5/4/2020 | 10 |
| 1 | 5/5/2020 | 1 |
| 2 | 5/1/2020 | 1 |
| 2 | 5/2/2020 | 2 |
| 2 | 5/3/2020 | 0 |
I want to delete rows that contain lesser step, which is 5/3/2020 for 0 step, 5/4/2020 for 2 step for Id 1.
I had tried using row_number() like this:
SELECT
Id,
Date,
step,
ROW_NUMBER() OVER (PARTITION BY Id, Date ORDER BY Id, Date) AS rn
FROM
`dataset.step`
WHERE rn>1
But that will give me rows with higher step, which is not want I want.
I also able to select rows with fewer step like this:
SELECT * FROM
`dataset.step` AS A
INNER JOIN
`dataset.step` AS B
ON A.Id = B.Id
AND A.Date = B.Date
WHERE A.step < B.step
But find no way to use it for delete.
Use below approach
select *
from your_table
qualify 1 = row_number() over win
window win as (partition by id, date order by step desc)
if applied to sample data in your question - output is

How to find longest subsequence based on conditions in Impala SQL

I have a SQL table on Impala that contains ID, dt (monthly basis with no skipped month), and status of each person ID. I want to check how long that each ID is in each status (my expected answer is shown on expected column)
I tried to solve this problem on the value column by using
count(status) over (partition by ID, status order by dt)
but it doesn't reset the value when the status is changed.
+------+------------+--------+-------+----------+
| ID | dt | status | value | expected |
+------+------------+--------+-------+----------+
| 0001 | 01/01/2020 | 0 | 1 | 1 |
| 0001 | 01/02/2020 | 0 | 2 | 2 |
| 0001 | 01/03/2020 | 1 | 1 | 1 |
| 0001 | 01/04/2020 | 1 | 2 | 2 |
| 0001 | 01/05/2020 | 1 | 3 | 3 |
| 0001 | 01/06/2020 | 0 | 3 | 1 |
| 0001 | 01/07/2020 | 1 | 4 | 1 |
| 0001 | 01/08/2020 | 1 | 5 | 2 |
+------+------------+--------+-------+----------+
Is there anyway to reset the counter when the status is changed?
When you partition by ID and status, two groups are formed for the values 0 and 1 in status field. So, the months 1, 2, 6 go into first group with 0 status and the months 3, 4, 5, 7, 8 go into the second group with 1 status. Then, the count function counts the number of statuses individually in those groups. Thus the first group has counts from 1 to 3 and the second group has counts from 1 to 5. This query so far doesn't account for the change in statuses rather just simply divide the record set as per different status values.
One approach would be to divide the records into different blocks where each status change starts a new block. The below query follows this approach and gives the expected result:
SELECT ID,dt,status,
COUNT(status) OVER(PARTITION BY ID,block_number ORDER BY dt) as value
FROM (
SELECT ID,dt,status,
SUM(change_in_status) OVER(PARTITION BY ID ORDER BY dt) as block_number
FROM(
SELECT ID,dt,status,
CASE WHEN
status<>LAG(status) OVER(PARTITION BY ID ORDER BY dt)
OR LAG(status) OVER(PARTITION BY ID ORDER BY dt) IS NULL
THEN 1
ELSE 0
END as change_in_status
FROM statuses
) derive_status_changes
) derive_blocks;
Here is a working example in DB Fiddle.

How do i get the latest user udpated column value in a table based on timestamp entry on a different table in SQL Server?

I have a temp table #StatusInfo with the following data
+---------+--------------+-------+-------------------------+--+
| OrderNo | GroupLineNum | Type1 | UpdateDate | |
+---------+--------------+-------+-------------------------+--+
| Order85 | NULL | 1 | 2019-11-25 05:15:55.000 | |
+---------+--------------+-------+-------------------------+--+
| Order86 | NULL | 1 | 2019-11-25 05:15:55.000 | |
+---------+--------------+-------+-------------------------+--+
| Order86 | 2 | 2 | 2019-11-25 05:32:23.773 | |
+---------+--------------+-------+-------------------------+--+
| Order87 | NULL | 1 | 2019-11-25 05:15:55.000 | |
+---------+--------------+-------+-------------------------+--+
| Order87 | 1 | 2 | 2019-11-25 05:43:37.637 | | B
+---------+--------------+-------+-------------------------+--+
| Order87 | 2 | 2 | 2019-11-25 05:42:32.390 | | A
+---------+--------------+-------+-------------------------+--+
| Order88 | NULL | 1 | 2019-11-25 06:35:13.000 | |
+---------+--------------+-------+-------------------------+--+
| Order88 | 1 | 2 | 2019-11-25 06:39:16.170 | |
+---------+--------------+-------+-------------------------+--+
Any update the user does on an order will be pulled into this temp table. Type 1 column with value 2 denotes a 'Required Date' field change by the user. The timestamp when the user made the change is the last column.
I have another temp table #LineInfo with the following data. This table is created by joining other tables and a left join with the above table too. The 'LineNum' column from below table will match the 'GroupLineNum' column in the above table for Type1=2
+---------+-----------+---------+------------+-------------------------+-------+
| OrderNo | RowNumber | LineNum | TotalCost | ReqDate | Type1 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order85 | 1 | 1 | 309.110000 | 2019-10-30 23:59:00.000 | 1 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order85 | 2 | 2 | 265.560000 | 2019-10-30 23:59:00.000 | 1 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order86 | 1 | 1 | 309.110000 | 2019-10-30 23:59:00.000 | 1 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order86 | 2 | 2 | 265.560000 | 2019-12-28 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order87 | 1 | 1 | 309.110000 | 2020-01-31 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order87 | 2 | 2 | 265.560000 | 2020-01-01 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order88 | 1 | 1 | 309.110000 | 2019-11-29 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
| Order88 | 2 | 2 | 265.560000 | 2019-12-31 23:59:00.000 | 2 |
+---------+-----------+---------+------------+-------------------------+-------+
I will be joining #lineInfo with other tables to generate a new table with only one record for an orderno. Its grouped by orderno.
What I need to do is ensure that the new selectquery will have a column 'ReqDate' which will be the latest ReqDate value for the order.
For example, Order87 has two lines in the order. User updated Line 2 first at '2019-11-25 05:42:32.390' as seen in the row marked 'A' followed by Line 1 marked B # '2019-11-25 05:43:37.637 ' from the first table.
The new query should have the data from LineInfo and only the 'ReqDate' value matching the 'LineNum' that has the maximum of 'UpdateDate' column for Type1=2 and group by orderno.
So in our example, the output should have the ReqDate value '2020-01-31 23:59:00.000'.
In short, an order should have the most recently updated required date. Order can have multiple line items where reqdate is udpated. If there is no entry in #StatusInfo table with Type2 for an order, then any one of the ReqDate value from the #LineInfo table will suffice. Maybe the first line
I wrote something like this but it doesnt pull orders without any entry in StatusInfo table. Those orders will have a default value even though user didnt udpate and i am not sure how to join the result of this with LineInfo table to set the latest value
Select SIT.Orderno, max_date,grouplinenum
from #StatusInfo SIT
inner join
(SELECT Orderno, MAX(ActDate) as max_date
FROM #StatusInfo SI
WHERE SI.Type1=2
GROUP BY SI.Orderno)a
on a.Orderno = SIT.Orderno and a.max_date = SIT.ActDate
This is what I did. I created the blow CTE to load orders with req date change in order of Updated date and assigned it row number. Record with row number 1 will be the most recently updated date
;WITH cteLatestReqDate AS ( --We need to pull the latest ReqDate value the user set. So we are are ordering the SIT table by ActDate and assigning a row number and respective line's required date here
SELECT SIT.OrderNo, SIT.UpdateDate, SIT.GroupLineNum, LLI.ReqDate,
ROW_NUMBER() OVER (PARTITION BY SIT.OrderNo ORDER BY ActDate DESC) AS RowNum
FROM #StatusInfo SIT INNER JOIN #LineLevelInfo LLI ON SIT.OrderNo = OI.OrderNo AND SIT.GroupLineNum = LLI.LineNum
WHERE SIT.Type1 = 2
)
and then I added the below condition to my select query. Below select query is partial
SELECT
CASE WHEN MAX(LRD.ReqDate) IS NULL THEN CAST(FORMAT(MAX(LLI.ReqDate), 'yyMMdd') AS NVARCHAR(10))
ELSE CAST(FORMAT(MAX(LRD.ReqDate), 'yyMMdd') AS NVARCHAR(10)) END AS LatestReqDate
FROM #LineLevelInfo LLI
LEFT JOIN(SELECT * FROM cteLatestReqDate WHERE RowNum = 1)LRD ON LRD.OrderNo = LLI.OrderNo And LRD.GroupLineNum = LLI.LineNum

In Redshift, how do I run the opposite of a SUM function

Assuming I have a data table
date | user_id | user_last_name | order_id | is_new_session
------------+------------+----------------+-----------+---------------
2014-09-01 | A | B | 1 | t
2014-09-01 | A | B | 5 | f
2014-09-02 | A | B | 8 | t
2014-09-01 | B | B | 2 | t
2014-09-02 | B | test | 3 | t
2014-09-03 | B | test | 4 | t
2014-09-04 | B | test | 6 | t
2014-09-04 | B | test | 7 | f
2014-09-05 | B | test | 9 | t
2014-09-05 | B | test | 10 | f
I want to get another column in Redshift which basically assigns session numbers to each users session. It starts at 1 for the first record for each user and as you move further down, if it encounters a true in the "is_new_session" column, it increments. Stays the same if it encounters a false. If it hits a new user, the value resets to 1. The ideal output for this table would be:
1
1
2
1
2
3
4
4
5
5
In my mind it's kind of the opposite of a SUM(1) over (Partition BY user_id, is_new_session ORDER BY user_id, date ASC)
Any ideas?
Thanks!
I think you want an incremental sum:
select t.*,
sum(case when is_new_session then 1 else 0 end) over (partition by user_id order by date) as session_number
from t;
In Redshift, you might need the windowing clause:
select t.*,
sum(case when is_new_session then 1 else 0 end) over
(partition by user_id
order by date
rows between unbounded preceding and current row
) as session_number
from t;

Select dynamic couples of lines in SQL (PostgreSQL)

My objective is to make dynamic group of lines (of product by TYPE & COLOR in fact)
I don't know if it's possible just with one select query.
But : I want to create group of lines (A PRODUCT is a TYPE and a COLOR) as per the number_per_group column and I want to do this grouping depending on the date order (Order By DATE)
A single product with a NB_PER_GROUP number 2 is exclude from the final result.
Table :
-----------------------------------------------
NUM | TYPE | COLOR | NB_PER_GROUP | DATE
-----------------------------------------------
0 | 1 | 1 | 2 | ...
1 | 1 | 1 | 2 |
2 | 1 | 2 | 2 |
3 | 1 | 2 | 2 |
4 | 1 | 1 | 2 |
5 | 1 | 1 | 2 |
6 | 4 | 1 | 3 |
7 | 1 | 1 | 2 |
8 | 4 | 1 | 3 |
9 | 4 | 1 | 3 |
10 | 5 | 1 | 2 |
Results :
------------------------
GROUP_NUMBER | NUM |
------------------------
0 | 0 |
0 | 1 |
~~~~~~~~~~~~~~~~~~~~~~~~
1 | 2 |
1 | 3 |
~~~~~~~~~~~~~~~~~~~~~~~~
2 | 4 |
2 | 5 |
~~~~~~~~~~~~~~~~~~~~~~~~
3 | 6 |
3 | 8 |
3 | 9 |
If you have another way to solve this problem, I will accept it.
What about something like this?
select max(gn.group_number) group_number, ip.num
from products ip
join (
select date, type, color, row_number() over (order by date) - 1 group_number
from (
select op.num, op.type, op.color, op.nb_per_group, op.date, (row_number() over (partition by op.type, op.color order by op.date) - 1) % nb_per_group group_order
from products op
) sq
where sq.group_order = 0
) gn
on ip.type = gn.type
and ip.color = gn.color
and ip.date >= gn.date
group by ip.num
order by group_number, ip.num
This may only work if your nb_per_group values are the same for each combination of type and color. It may also require unique dates, but that could probably be worked around if required.
The innermost subquery partitions the rows by type and color, orders them by date, then calculates the row numbers modulo nb_per_group; this forms a 0-based count for the group that resets to 0 each time nb_per_group is exceeded.
The next-level subquery finds all of the 0 values we mapped in the lower subquery and assigns group numbers to them.
Finally, the outermost query ties each row in the products table to a group number, calculated as the highest group number that split off before this product's date.