Working on migrating old system data to new system.
I need to group the data based on id and name.
We need to start date as min date and end date as max date.
If any id and name combination contains falls under the same period . We can avoid duplicate and choose lowest to highest date.
Legacy System
New System Expectation
ID - 139247 contains duplicate rows based on name.
Added data in - https://dbfiddle.uk/?rdbms=oracle_18&fiddle=8d6877847c5e052adf703430b5c7f083
Please let me know if more details needed. Thanks in advance.
This is a type of gaps-and-islands problem. Because you want any overlaps, I would go for a cumulative max of the previous enddate to determine where the islands being:
select id, name, min(startdate) as startdate,
(case when count(enddate) = count(*) then max(enddate)
end) as enddate
from (select t.*,
sum(case when prev_enddate >= startdate then 0 else 1 end) over (partition by id, name) as grp
from (select t.*,
max(enddate) over (partition by id, name order by startdate range between unbounded preceding and interval '1' day preceding) as prev_enddate
from t
) t
) t
group by id, name, grp
order by name, startdate;
Here is a db<>fiddle.
Related
I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.
This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;
Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date
i have a sql table which the following data shown in the picture
I need to create a query in sql which counts for ticker the number of consecutive days per year in which
the close_value is greater than the open_value, if close_value is less than the open value the counter must be reset to zero and I have to save the counter in that instant
This is an example of a gaps-and-islands problem. You can use the difference of row_numbers():
select ticker, min(date), max(date), min(open_value), max(close_value),
count(*) as num_rows
from (select t.*,
row_number() over (partition by ticker order by date) as seqnum,
row_number() over (partition by ticker, (case when close_value > open_value then 1 else 2 end) order by date) as seqnum_2
from t
) t
where close_value > open_value
group by ticker, (seqnum - seqnum_2);
This returns all such periods. You haven't specified what the result set should look like, but this should be pretty close.
I have a table that contains the following columns: Date, Customer, Active Flag. I need to add a fourth column called Start. The Start column should return the first date the client was active, based on consecutive active flags.
shows the three columns I currently have and the results I wish to return for the Start column.
Your insight into what my SQL code should look like to achieve this would be appreciated. Thanks!!
You can do this without subqueries, if I assume one date per month per customer:
select t.*,
(case when activeflag = 1
then coalesce(max(case when activeflag = 0 then date end) over (partition by customer order by date) + interval '1 month',
min(case when activeflag = 1 then date end) over (partition by customer)
)
end) as start
from t;
Subqueries, though, might make this easier. You can treat this as a gaps-and-islands problem:
select t.*,
(case when activeflag = 1
then min(date) over (partition by customerid, seqnum - seqnum_a)
end) as start
from (select t.*,
row_number() over (partition by customerid order by date) as seqnum,
row_number() over (partition by customerid, activeflag order by date) as seqnum_a
from t
) t
I have a requirement to get values from a table based on an offset conditions on a date column.
Say for eg: for the below attached table, if there is any dates that comes close within 15 days based on effectivedate column I should return only the first one.
So my expected result would be as below:
Here for A1234 policy, it returns 6/18/16 entry and skipped 6/12/16 entry as the offset between these 2 dates is within 15 days and I took the latest one from the list.
If you want to group rows together that are within 15 days of each other, then you have a variant of the gaps-and-islands problem. I would recommend lag() and cumulative sum for this version:
select polno, min(effectivedate), max(expirationdate)
from (select t.*,
sum(case when prev_ed >= dateadd(day, -15, effectivedate)
then 1 else 0
end) over (partition by polno order by effectivedate) as grp
from (select t.*,
lag(expirationdate) over (partition by polno order by effectivedate) as prev_ed
from t
) t
) t
group by polno, grp;
I have list of values in a databse. There are many redundancies and I want to get rid of them. As you can see in the list below, dates [10/1/2011 - 7/1/2011) have a value of 0. I can make that into one entry with a start date of 10/1/2011 and an end date of 6/1/2011 and a value of 0 and delete all the other rows. I can do that for all the other similar values as well.
Here is my problem. I did this by writing a query that groups these together and then takes the Min(start date) as the start date and the Max(end date) as the end date. Notice that I have two groups of 0 though. When I group this in the query, the start date is 10/1/2010 and the end date is 2/1/2013. This is a problem elsewhere in my code because whenever it looks for a value at 2/1/2012 it finds 0 but it should be finding .955186.
Any suggestions on how I can write a query to account for this problem?
This is a "gaps-and-islands" problem.
If I assume that the first column is sufficient for defining the groups, then you can use a difference of row_number()s:
select min(startdate), max(enddate), value
from (select t.*,
row_number() over (order by startdate) as seqnum,
row_number() over (partition by value order by startdate) as seqnum_v
from t
) t
group by (seqnum - seqnum_v), value;
It is a gap and islands problem. You may use the following query (using SQL Server syntax, however, it can be easily altered).
select min(startdate) startDate, max(enddate) endDate, value
from
(
select *,
row_number() over (partition by value order by startDate) - (year(startDate) * 12) - month(startDate) grp
from data
) t
group by value, grp
order by startDate
It is using just one row_number() which may be better than two since the DBMS does not have to pass the table twice to generate the sequences.