How to find longest subsequence based on conditions in Impala SQL - sql

I have a SQL table on Impala that contains ID, dt (monthly basis with no skipped month), and status of each person ID. I want to check how long that each ID is in each status (my expected answer is shown on expected column)
I tried to solve this problem on the value column by using
count(status) over (partition by ID, status order by dt)
but it doesn't reset the value when the status is changed.
+------+------------+--------+-------+----------+
| ID | dt | status | value | expected |
+------+------------+--------+-------+----------+
| 0001 | 01/01/2020 | 0 | 1 | 1 |
| 0001 | 01/02/2020 | 0 | 2 | 2 |
| 0001 | 01/03/2020 | 1 | 1 | 1 |
| 0001 | 01/04/2020 | 1 | 2 | 2 |
| 0001 | 01/05/2020 | 1 | 3 | 3 |
| 0001 | 01/06/2020 | 0 | 3 | 1 |
| 0001 | 01/07/2020 | 1 | 4 | 1 |
| 0001 | 01/08/2020 | 1 | 5 | 2 |
+------+------------+--------+-------+----------+
Is there anyway to reset the counter when the status is changed?

When you partition by ID and status, two groups are formed for the values 0 and 1 in status field. So, the months 1, 2, 6 go into first group with 0 status and the months 3, 4, 5, 7, 8 go into the second group with 1 status. Then, the count function counts the number of statuses individually in those groups. Thus the first group has counts from 1 to 3 and the second group has counts from 1 to 5. This query so far doesn't account for the change in statuses rather just simply divide the record set as per different status values.
One approach would be to divide the records into different blocks where each status change starts a new block. The below query follows this approach and gives the expected result:
SELECT ID,dt,status,
COUNT(status) OVER(PARTITION BY ID,block_number ORDER BY dt) as value
FROM (
SELECT ID,dt,status,
SUM(change_in_status) OVER(PARTITION BY ID ORDER BY dt) as block_number
FROM(
SELECT ID,dt,status,
CASE WHEN
status<>LAG(status) OVER(PARTITION BY ID ORDER BY dt)
OR LAG(status) OVER(PARTITION BY ID ORDER BY dt) IS NULL
THEN 1
ELSE 0
END as change_in_status
FROM statuses
) derive_status_changes
) derive_blocks;
Here is a working example in DB Fiddle.

Related

BigQuery: delete duplicated row that are not fully duplicated (delete desire row)

I have a table recording customer step on daily basis. The table had Id, date and step column. Some rows contained different steps on the same day for the same Id. Sample as shown below on 5/3/2020 and 5/4/2020 for Id 1:
| Id | Date | Step |
|:-----|:---------|:-----|
| 1 | 5/1/2020 | 1 |
| 1 | 5/2/2020 | 1 |
| 1 | 5/3/2020 | 0 |
| 1 | 5/3/2020 | 5 |
| 1 | 5/4/2020 | 2 |
| 1 | 5/4/2020 | 10 |
| 1 | 5/5/2020 | 1 |
| 2 | 5/1/2020 | 1 |
| 2 | 5/2/2020 | 2 |
| 2 | 5/3/2020 | 0 |
I want to delete rows that contain lesser step, which is 5/3/2020 for 0 step, 5/4/2020 for 2 step for Id 1.
I had tried using row_number() like this:
SELECT
Id,
Date,
step,
ROW_NUMBER() OVER (PARTITION BY Id, Date ORDER BY Id, Date) AS rn
FROM
`dataset.step`
WHERE rn>1
But that will give me rows with higher step, which is not want I want.
I also able to select rows with fewer step like this:
SELECT * FROM
`dataset.step` AS A
INNER JOIN
`dataset.step` AS B
ON A.Id = B.Id
AND A.Date = B.Date
WHERE A.step < B.step
But find no way to use it for delete.
Use below approach
select *
from your_table
qualify 1 = row_number() over win
window win as (partition by id, date order by step desc)
if applied to sample data in your question - output is

SQL - group by a change of value in a given column

Apologies for the confusing title, I was unsure how to phrase it.
Below is my dataset:
+----+-----------------------------+--------+
| Id | Date | Amount |
+----+-----------------------------+--------+
| 1 | 2019-02-01 12:14:08.8056282 | 10 |
| 1 | 2019-02-04 15:23:21.3258719 | 10 |
| 1 | 2019-02-06 17:29:16.9267440 | 15 |
| 1 | 2019-02-08 14:18:14.9710497 | 10 |
+----+-----------------------------+--------+
It is an example of a bank trying to collect money from a debtor, where first, 10% of the owed sum is attempted to be collected, if a card is managed to be charged 15% is attempted, if that throws an error (for example insufficient funds), 10% is attempted again.
The desired output would be:
+----+--------+---------+
| Id | Amount | Attempt |
+----+--------+---------+
| 1 | 10 | 1 |
| 1 | 15 | 2 |
| 1 | 10 | 3 |
+----+--------+---------+
I have tried:
SELECT Id, Amount
FROM table1
GROUP BY Id, Amount
I am struggling to create a new column based on when value changes in the Amount column as I assume that could be used as another grouping variable that could fix this.
If you just want when a value changes, use lag():
select t.id, t.amount,
row_number() over (partition by id order by date) as attempt
from (select t.*, lag(amount) over (partition by id order by date) as prev_amount
from table1 t
) t
where prev_amount is null or prev_amount <> amount

Rank rows in a column under conditions on a different column

I have the following dataset:
id | date | state
-----------------------
1 | 01/01/17 | high
1 | 02/01/17 | high
1 | 03/01/17 | high
1 | 04/01/17 | miss
1 | 05/01/17 | high
2 | 01/01/17 | miss
2 | 02/01/17 | high
2 | 03/01/17 | high
2 | 04/01/17 | miss
2 | 05/01/17 | miss
2 | 06/01/17 | high
I want to create a column rank_state which ranks, within groups of id, the entries as per increasing date (starting from rank 0) which do not have the state of "miss". Furthermore, the rank repeats itself if the entry has a state of "miss". The output should look like:
id | date | state | rank_state
------------------------------------
1 | 01/01/17 | high | 0
1 | 02/01/17 | high | 1
1 | 03/01/17 | high | 2
1 | 04/01/17 | miss | 2
1 | 05/01/17 | high | 3
2 | 01/01/17 | miss | 0
2 | 02/01/17 | high | 0
2 | 03/01/17 | high | 1
2 | 04/01/17 | miss | 1
2 | 05/01/17 | miss | 1
2 | 06/01/17 | high | 2
For example, the 4th row has a rank of 2 since it's state is "miss", i.e. it repeats the rank of row 3 (the same applies to rows 9 and 10). Please note that rows 6 and 7 should have rank 0.
I have tried the following:
,(case when state is not in ('miss') then (rank() over (partition by id order by date desc) - 1) end) as state_rank
and
,rank() over (partition by id order by case when state is not in ('miss') then date end) as state_rank
but neither give me the desired result. Any ideas would be very helpful.
More than likely you want:
SELECT *,
GREATEST(
COUNT(case when state != 'miss' then 1 else null end)
OVER(PARTITION BY id ORDER BY date) - 1,
0
) as "state_rank"
FROM tbl;
SQL Fiddle
Basically:
make your window frame (partition) overid
only count the ones that aren't 'miss'
because it could be a negative number if starting the record, you can slap on the GREATEST to use 0 (preventing negatives)
Just add frame_clause to vol7ron's answer since Redshift requires it :
select *
, GREATEST(COUNT(case when state != 'miss' then 1 else null end)
OVER(PARTITION BY id order by date rows between unbounded preceding and current row) -1 , 0 ) as state_rank
from tbl;

In Redshift, how do I run the opposite of a SUM function

Assuming I have a data table
date | user_id | user_last_name | order_id | is_new_session
------------+------------+----------------+-----------+---------------
2014-09-01 | A | B | 1 | t
2014-09-01 | A | B | 5 | f
2014-09-02 | A | B | 8 | t
2014-09-01 | B | B | 2 | t
2014-09-02 | B | test | 3 | t
2014-09-03 | B | test | 4 | t
2014-09-04 | B | test | 6 | t
2014-09-04 | B | test | 7 | f
2014-09-05 | B | test | 9 | t
2014-09-05 | B | test | 10 | f
I want to get another column in Redshift which basically assigns session numbers to each users session. It starts at 1 for the first record for each user and as you move further down, if it encounters a true in the "is_new_session" column, it increments. Stays the same if it encounters a false. If it hits a new user, the value resets to 1. The ideal output for this table would be:
1
1
2
1
2
3
4
4
5
5
In my mind it's kind of the opposite of a SUM(1) over (Partition BY user_id, is_new_session ORDER BY user_id, date ASC)
Any ideas?
Thanks!
I think you want an incremental sum:
select t.*,
sum(case when is_new_session then 1 else 0 end) over (partition by user_id order by date) as session_number
from t;
In Redshift, you might need the windowing clause:
select t.*,
sum(case when is_new_session then 1 else 0 end) over
(partition by user_id
order by date
rows between unbounded preceding and current row
) as session_number
from t;

Select dynamic couples of lines in SQL (PostgreSQL)

My objective is to make dynamic group of lines (of product by TYPE & COLOR in fact)
I don't know if it's possible just with one select query.
But : I want to create group of lines (A PRODUCT is a TYPE and a COLOR) as per the number_per_group column and I want to do this grouping depending on the date order (Order By DATE)
A single product with a NB_PER_GROUP number 2 is exclude from the final result.
Table :
-----------------------------------------------
NUM | TYPE | COLOR | NB_PER_GROUP | DATE
-----------------------------------------------
0 | 1 | 1 | 2 | ...
1 | 1 | 1 | 2 |
2 | 1 | 2 | 2 |
3 | 1 | 2 | 2 |
4 | 1 | 1 | 2 |
5 | 1 | 1 | 2 |
6 | 4 | 1 | 3 |
7 | 1 | 1 | 2 |
8 | 4 | 1 | 3 |
9 | 4 | 1 | 3 |
10 | 5 | 1 | 2 |
Results :
------------------------
GROUP_NUMBER | NUM |
------------------------
0 | 0 |
0 | 1 |
~~~~~~~~~~~~~~~~~~~~~~~~
1 | 2 |
1 | 3 |
~~~~~~~~~~~~~~~~~~~~~~~~
2 | 4 |
2 | 5 |
~~~~~~~~~~~~~~~~~~~~~~~~
3 | 6 |
3 | 8 |
3 | 9 |
If you have another way to solve this problem, I will accept it.
What about something like this?
select max(gn.group_number) group_number, ip.num
from products ip
join (
select date, type, color, row_number() over (order by date) - 1 group_number
from (
select op.num, op.type, op.color, op.nb_per_group, op.date, (row_number() over (partition by op.type, op.color order by op.date) - 1) % nb_per_group group_order
from products op
) sq
where sq.group_order = 0
) gn
on ip.type = gn.type
and ip.color = gn.color
and ip.date >= gn.date
group by ip.num
order by group_number, ip.num
This may only work if your nb_per_group values are the same for each combination of type and color. It may also require unique dates, but that could probably be worked around if required.
The innermost subquery partitions the rows by type and color, orders them by date, then calculates the row numbers modulo nb_per_group; this forms a 0-based count for the group that resets to 0 each time nb_per_group is exceeded.
The next-level subquery finds all of the 0 values we mapped in the lower subquery and assigns group numbers to them.
Finally, the outermost query ties each row in the products table to a group number, calculated as the highest group number that split off before this product's date.