How can i group rows on sql base on condition - sql

I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.

This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;

Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date

Related

Redshift - Group Table based on consecutive rows

I am working right now with this table:
What I want to do is to clear up this table a little bit, grouping some consequent rows together.
Is there any form to achieve this kind of result?
The first table is already working fine, I just want to get rid of some rows to free some disk space.
One method is to peak at the previous row to see when the value changes. Assuming that valid_to and valid_from are really dates:
select id, class, min(valid_to), max(valid_from)
from (select t.*,
sum(case when prev_valid_to >= valid_from + interval '-1 day' then 0 else 1 end) over (partition by id order by valid_to rows between unbounded preceding and current row) as grp
from (select t.*,
lag(valid_to) over (partition by id, class order by valid_to) as prev_valid_to
from t
) t
) t
group by id, class, grp;
If the are not dates, then this gets trickier. You could convert to dates. Or, you could use the difference of row_numbers:
select id, class, min(valid_from), max(valid_to)
from (select t.*,
row_number() over (partition by id order by valid_from) as seqnum,
row_number() over (partition by id, class order by valid_from) as seqnum_2
from t
) t
group by id, class, (seqnum - seqnum_2)

MariaDB get first and last record of the month - nested query

I am using MariaDB and I have these kind of data:
I have also data for March and I am using this query to select distinct Months from the database:
select distinct(DATE_FORMAT(DT,'%m-%Y')) AS singleMonth FROM myTable
I want to be able to select FIRST and LAST record of P2 column for every month. How it is possible using the query above for getting all distinct months and also getting first record for the month and last?
Example what the query should return look-like:
You can use window functions and conditional aggregation:
select year(dt), month(dt),
min(case when seqnumn_asc = 1 then p2 end) as first_p2,
min(case when seqnumn_desc = 1 then p2 end) as last_p2
from (select t.*,
row_number() over (partition by year(dt), month(dt) order by dt asc) as seqnum_asc,
row_number() over (partition by year(dt), month(dt) order by dt desc) as seqnum_desc
from t
) t
group by year(dt), month(dt);

Rank the closest date, but not if date has already been used in previous row

Complicated to summarize the issue in the title, and in text so check the IMG for visual explanation.
I've got an issue joining two tables. The date from the first table (startdatetable) should get the next/closest date in the other table (enddatetable). This is to 99% easy done with rank, because most rows have a startdate that can find the next enddate, and before another enddate is available there is a new startdate.
However, if there are two startdates and only one enddate available the startdates will join the same enddate.
What I'm trying to do is, that if a date has been used in the row before, it should not be used in the next row.
The rows I want are highlighted.
The SQL I started out with looked like
select *
from (
select rank() over (partition by tid order by enddate asc) as rnk
, id, startdate, enddate
from startdateTable
inner join enddateTable on startdateTable.ID = enddateTable.id
and enddateTable.enddate > startdateTable.startdate
) q
where q.rnk = 1
This gets me the following result. The last row should instead get the 2100 date, since the 2020-06 date has been used in the previous row.
If you have two tables and you want to align them, then you can use row_number():
select s.id, s.startdate, e.enddate
from (select s.*, row_number() over (partition by id order by startdate) as seqnum
from startdateTable
) s join
(select e.*, row_number() over (partition by id order by enddate) as seqnum
from enddateTable
) e
on e.id = s.id and e.seqnum = s.seqnum

ORACLE SQL: Find last minimum and maximum consecutive period

I have the sample data set below which list the water meters not working for specific reason for a certain range period (jan 2016 to december 2018).
I would like to have a query that retrieves the last maximum and minimum consecutive period where the meter was not working within that range of period.
any help will be greatly appreciated.
You have two options:
select code, to_char(min_period, 'yyyymm') min_period, to_char(max_period, 'yyyymm') max_period
from (
select code, min(period) min_period, max(period) max_period,
max(min(period)) over (partition by code) max_min_period
from (
select code, period, sum(flag) over (partition by code order by period) grp
from (
select code, period,
case when add_months(period, -1)
= lag(period) over (partition by code order by period)
then 0 else 1 end flag
from (select mrdg_acc_code code, to_date(mrdg_per_period, 'yyyymm') period from t)))
group by code, grp)
where min_period = max_min_period
Explanation:
flag rows where period is not equal previous period plus one month,
create column grp which sums flags consecutively,
group data using code and grp additionaly finding maximal start of period,
show only rows where min_period = max_min_period
Second option is recursive CTE available in Oracle 11g and above:
with
data(period, code) as (
select to_date(mrdg_per_period, 'yyyymm'), mrdg_acc_code from t
where mrdg_per_period between 201601 and 201812),
cte (period, code) as (
select to_char(period, 'yyyymm'), code from data
where (period, code) in (select max(period), code from data group by code)
union all
select to_char(data.period, 'yyyymm'), cte.code
from cte
join data on data.code = cte.code
and data.period = add_months(to_date(cte.period, 'yyyymm'), -1))
select code, min(period) min_period, max(period) max_period
from cte group by code
Explanation:
subquery data filters only rows from 2016 - 2018 additionaly converting period to date format. We need this for function add_months to work.
cte is recursive. Anchor finds starting rows, these with maximum period for each code. After union all is recursive member, which looks for the row one month older than current. If it finds it then net row, if not then stop.
final select groups data. Notice that period which were not consecutive were rejected by cte.
Though recursive queries are slower than traditional ones, there can be scenarios where second solution is better.
Here is the dbfiddle demo for both queries. Good luck.
use aggregate function with group by
select max(mdrg_per_period) mdrg_per_period, mrdg_acc_code,max(mrdg_date_read),rea_Desc,min(mdrg_per_period) not_working_as_from
from tablename
group by mrdg_acc_code,rea_Desc
This is a bit tricky. This is a gap-and-islands problem. To get all continuous periods, it will help if you have an enumeration of months. So, convert the period to a number of months and then subtract a sequence generated using row_number(). The difference is constant for a group of adjacent months.
This looks like:
select acc_code, min(period), max(period)
from (select t.*,
row_number() over (partition by acc_code order by period_num) as seqnum
from (select t.*, floor(period / 100) * 12 + mod(period, 100) as period_num
from t
) t
where rea_desc = 'METER NOT WORKING'
) t
group by (period_num - seqnum);
Then, if you want the last one for each account, you can use a subquery:
select t.*
from (select acc_code, min(period), max(period),
row_number() over (partition by acc_code order by max(period desc) as seqnum
from (select t.*,
row_number() over (partition by acc_code order by period_num) as seqnum
from (select t.*, floor(period / 100) * 12 + mod(period, 100) as period_num
from t
) t
where rea_desc = 'METER NOT WORKING'
) t
group by (period_num - seqnum)
) t
where seqnum = 1;

Running count distinct

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.
I am trying to turn it into the table below. Email 1#gmail.com was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.
Thanks for the help.
I would usually do this using row_number():
select date, count(*),
sum(count(*)) over (order by date),
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
row_number() over (partition by email order by date) as seqnum
from t
) t
group by date
order by date;
This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.
Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.
select distinct dt
,count(*) over(partition by dt) as day_total
,count(*) over(order by dt) as cumsum
,sum(flag) over(order by dt) as cumdist
from (select t.*
,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag
from tbl t
) t
DEMO HERE
Here is a solution that does not uses sum over, neither lag... And does produces the correct results.
Hence it could appear as simpler to read and to maintain.
select
t1.date_created,
(select count(*) from my_table where date_created = t1.date_created) emails_created,
(select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
(select count( distinct email) from my_table where date_created <= t1.date_created) running_count_distinct
from
(select distinct date_created from my_table) t1
order by 1