Min and max value per group keeping order - sql

I have a small problem in Redshift with with grouping; I have a table like following:
INPUT
VALUE CREATED UPDATED
------------------------------------
1 '2020-09-10' '2020-09-11'
1 '2020-09-11' '2020-09-13'
2 '2020-09-15' '2020-09-16'
1 '2020-09-17' '2020-09-18'
I want to obtain this output:
VALUE CREATED UPDATED
------------------------------------
1 '2020-09-10' '2020-09-13'
2 '2020-09-15' '2020-09-16'
1 '2020-09-17' '2020-09-18'
If I do a simple Min and Max date grouping by the value, it doesn't work.

This is an example of a gap-and-islands problem. If there are no time gaps in the data, then a difference of row numbers is a simple solution:
select value, min(created), max(updated)
from (select t.*,
row_number() over (order by created) as seqnum,
row_number() over (partition by value order by created) as seqnum_2
from t
) t
group by value, (seqnum - seqnum_2)
order by min(created);
Why this works is a little tricky to explain. But if you look at the results of the subquery, you will see how the difference between the row numbers identifies adjacent rows with the same value.

Related

BigQuery Standard SQL - Cumulative Count of (almost) Duplicated Rows

With the following data:
id
field
eventTime
1
A
1
1
A
2
1
B
3
1
A
4
1
B
5
1
B
6
1
B
7
For visualisation purposes, I would like to turn it into the below. Consecutive occurrences of the same field value essentially get aggregated into one.
id
field
eventTime
1
Ax2
1
1
B
3
1
A
4
1
Bx3
5
I will then use STRING_AGG() to turn it into "Ax2 > B > A > Bx3".
I've tried using ROW_NUMBER() to count the repeated instances, with the plan being to utilise the highest row number to modify the string in field, but if I partition on eventTime, there are no consecutive "duplicates", and if I don't partition on it then all rows with the same field value are counted - not just consecutive ones.
I though about bringing in the previous field with LAG() for a comparison to reset the row count, but that only works for transitions from one field value to the other and is a problem if the same field is repeated consecutively.
I'm been struggling with this to the point where I'm considering writing a script that just CASE WHENs up to a reasonable number of consecutive hits, but I've seen it get as high as 17 on a given day and really don't want to be doing that!
My other alternative will just be to enforce a maximum number of field values to help control this, but now I've started this problem I'd quite like to solve it without that, if at all possible.
Thanks!
Consider below
select id,
any_value(field) || if(count(1) = 1, '', 'x' || count(1)) field,
min(eventTime) eventTime
from (
select id, field, eventTime,
countif(ifnull(flag, true)) over(partition by id order by eventTime) grp
from (
select id, field, eventTime,
field != lag(field) over(partition by id order by eventTime) flag
from `project.dataset.table`
)
)
group by id, grp
# order by eventTime
If applied to sample data in your question - output is
Just use lag() to detect when the value of field changes. You can now do that with qualify:
select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field;
For your final step, you can use a subquery:
select id, string_agg(field, '->' order by eventtime)
from (select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field
) t
group by id;

Complex Ranking in SQL (Teradata)

I have a peculiar problem at hand. I need to rank in the following manner:
Each ID gets a new rank.
rank #1 is assigned to the ID with the lowest date. However, the subsequent dates for that particular ID can be higher but they will get the incremental rank w.r.t other IDs.
(E.g. ADF32 series will be considered to be ranked first as it had the lowest date, although it ends with dates 09-Nov, and RT659 starts with 13-Aug it will be ranked subsequently)
For a particular ID, if the days are consecutive then ranks are same, else they add by 1.
For a particular ID, ranks are given in date ASC.
How to formulate a query?
You need two steps:
select
id_col
,dt_col
,dense_rank()
over (order by min_dt, id_col, dt_col - rnk) as part_col
from
(
select
id_col
,dt_col
,min(dt_col)
over (partition by id_col) as min_dt
,rank()
over (partition by id_col
order by dt_col) as rnk
from tab
) as dt
dt_col - rnk caluclates the same result for consecutives dates -> same rank
Try datediff on lead/lag and then perform partitioned ranking
select t.ID_COL,t.dt_col,
rank() over(partition by t.ID_COL, t.date_diff order by t.dt_col desc) as rankk
from ( SELECT ID_COL,dt_col,
DATEDIFF(day, Lag(dt_col, 1) OVER(ORDER BY dt_col),dt_col) as date_diff FROM table1 ) t
One way to think about this problem is "when to add 1 to the rank". Well, that occurs when the previous value on a row with the same id_col differs by more than one day. Or when the row is the earliest day for an id.
This turns the problem into a cumulative sum:
select t.*,
sum(case when prev_dt_col = dt_col - 1 then 0 else 1
end) over
(order by min_dt_col, id_col, dt_col) as ranking
from (select t.*,
lag(dt_col) over (partition by id_col order by dt_col) as prev_dt_col,
min(dt_col) over (partition by id_col) as min_dt_col
from t
) t;

RANK in SQL but start at 1 again when number is greater than

I need an sql code for the below. I want it to RANK however if DSLR >= 60 then I want the rank to start again like below.
Thanks
Assuming that you have a column that defines the ordering of the rows, say id, you can address this as a gaps-and-islands problem. Islands are group of adjacent record that start with a dslr above 60. We can identify them with a window sum, then rank within each island:
select dslr, rank() over(partition by grp order by id) as rn
from (
select t.*,
sum(case when dslr >= 60 then 1 else 0 end) over(order by id) as grp
from mytable t
) t

Is there a way to split rows into groups based on certain values?

consider this table:
I want to divide these rows into groups based on their id and price values: as long as two rows have the same id and price and are not divided by any other row they belong to the same group, so I expect the output to be sorta like this:
I tried using window functions but with them I ended up with the last row having the same group as the first 3. Is there something I'm missing?
This is a gaps-and-islands problem. One method is to use lag() to detect changes and then a cumulative sum:
select t.*,
sum(case when prev_price = price then 0 else 1 end) over
(partition by id order by dt) as group_num
from (select t.*,
lag(price) over (partition by id order by dt) as prev_price
from t
) t

SQL query for backfilling register read values

I have a table with ID,timestamp,register reads for a day, the register reads are like running totals starts at 12.00 at midnight and ends at 11.00 at night.
Problem is there are some random timeintervals in which the cumulative reads may not be present, I need to back fill those,
The below picture gives a snapshot of the problem, The KWH_RDNG is the difference between two cumulative intervals divided by 1000, but the 4th column 5.851 is actually accumulation of 3 missing hours along with the 4th hour value. its fine if i simply divide 5.851/4 and distribute it.
The challenge is they can happen at random intervals and it can be different for different meters (1st column). I am using SQL Server 2016.
Please help.!!
This is a gaps and islands problem -- sort of. You need to identify groups of NULL values with the subsequent value. One method is to use a cumulative sum of the non-NULL value on or after each value. This defines the groups.
Then, you need the count and the reading. So, this should do the calculation:
select t.*,
(max_kwh_rding / cnt) as new_kwh_rding
from (select t.*, count(*) over (partition by meter_serial, grp) as cnt,
max(kwh_rding) over (partition by meter_serial, grp) as max_kwh_rding
from (select t.*,
count(kwh_rding) over (partition by meter_serial order by read_utc desc rows between unbounded preceding and current row) as grp
from t
) t
) t
where cnt > 1;
You can incorporate this into an update:
with toupdate as (
select t.*,
(max_kwh_rding / cnt) as new_kwh_rding
from (select t.*, count(*) over (partition by meter_serial, grp) as cnt,
max(kwh_rding) over (partition by meter_serial, grp) as max_kwh_rding
from (select t.*,
count(kwh_rding) over (partition by meter_serial order by read_utc desc rows between unbounded preceding and current row) as grp
from t
) t
) t
where cnt > 1
)
update toupdate
set kwh_rding = max_kwh_rding;