SQL Query - Combine rows based on multiple columns

SQL Query - Combine rows based on multiple columns - sql

On the image above, I'd like to combine rows with the same value on consecutive days.
Combined rows will have the earliest date on From column and the latest date on To column.
Looking at the example, even if Rows 3 and 4 have the same value, they were not combined because of the date gap.
I've tried using LAG and LEAD functions but no luck.

You can try below way -
DEMO
with c as
(
select *, datediff(dd,todate,laedval) as leaddiff,
datediff(dd,todate,lagval) as lagdiff
from
(
select *,lead(todate) over(partition by value order by todate) laedval,
lag(todate) over(partition by value order by todate) lagval
from t1
)A
)
select * from
(
select value,min(todate) as fromdate,max(todate) as todate from c
where coalesce(leaddiff,0)+coalesce(lagdiff,0) in (1,-1)
group by value
union all
select value,fromdate,todate from c
where coalesce(leaddiff,0)+coalesce(lagdiff,0)>1 or coalesce(leaddiff,0)+coalesce(lagdiff,0)<-1
)A order by value
OUTPUT:
value fromdate todate
1 16/07/2019 00:00:00 17/07/2019 00:00:00
3 21/07/2019 00:00:00 26/07/2019 00:00:00
2 18/07/2019 00:00:00 18/07/2019 00:00:00
2 20/07/2019 00:00:00 20/07/2019 00:00:00

I am going to recommend the following approach:
Find where each new group begins. You can do this by comparing the previous maximum todate with the fromdate in this row.
Do a cumulative sum of the starts to define a group.
Aggregate the results.
This can be handled using window functions and aggregation:
select value, min(fromdate) as fromdate, max(todate) as todate
from (select t.*,
sum(case when prev_todate >= dateadd(day, -1, fromdate)
then 0 -- overlap, so this does not begin a new group
else 1 -- no overlap, so this does begin a new group
end) over
(partition by value order by fromdate) as grp
from (select t.*,
max(todate) over (partition by value
order by fromdate
rows between unbounded preceding and 1 preceding
) as prev_todate
from t
) t
) t
group by value, grp
order by value, min(fromdate);
Here is a db<>fiddle.

Related

How can i group rows on sql base on condition

I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.

This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;

Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date

Oracle SQL LAG() function results in duplicate rows

I have a very simple query that results in two rows:
SELECT DISTINCT
id,
trunc(start_date) start_date
FROM example.table
WHERE ID = 1
This results in the following rows:
id start_date
1 7/1/2012
1 9/1/2016
I want to add a column that simply shows the previous date for each row. So I'm using the following:
SELECT DISTINCT id,
Trunc(start_date) start_date,
Lag(start_date, 1)
over (
ORDER BY start_date) pdate
FROM example.table
WHERE id = 1
However, when I do this, I get four rows instead of two:
id start_date pdate
1 7/1/2012 NULL
1 7/1/2012 7/1/2012
1 9/1/2016 7/1/2012
1 9/1/2016 9/1/2012
If I change the offset to 2 or 3 the results remain the same. If I change the offset to 0, I get two rows again but of course now the start_date == pdate.
I can't figure out what's going on

Use an explicit GROUP BY instead:
SELECT id, trunc(start_date) as start_date,
LAG(trunc(start_date)) OVER (PARTITION BY id ORDER BY trunc(start_date))
FROM example.table
WHERE ID = 1
GROUP BY id, trunc(start_date)

The reason for this is: the order of execution of an SQL statements, is that LAG runs before the DISTINCT.
You actually want to run the LAG after the DISTINCT, so the right query should be:
WITH t1 AS (
SELECT DISTINCT id, trunc(start_date) start_date
FROM example.table
WHERE ID = 1
)
SELECT *, LAG(start_date, 1) OVER (ORDER BY start_date) pdate
FROM t1

Fill missing gaps in data using a date column

I have a temp table that returns this output
PRICE | DATE
1.491500 | 2019-02-01
1.494000 | 2019-02-04
1.486500 | 2019-02-06
I want to fill in the missing gaps in data by duplicating the last known record prior to the gaps in data using the date. Is their a way to update the existing temp table or create a new temp table with this desired output dynamically:
PRICE | DATE
1.491500 | 2019-02-01
1.491500 | 2019-02-02
1.491500 | 2019-02-03
1.494000 | 2019-02-04
1.494000 | 2019-02-05
1.486500 | 2019-02-06
I am working on sql server 2008r2

Because SQL Server does not support IGNORE NULLS in LAG() this is a bit tricky. I would go for a recursive subquery of the form:
with cte as (
select price, date, dateadd(day, -1, lead(date) over (order by date)) as last_date
from t
union all
select price, dateadd(day, 1, date), last_date
from cte
where date < last_date
)
select price, date
from cte
order by date;
Here is a db<>fiddle.
In SQL Server 2008, you can replace the lead() with:
with cte as (
select price, date,
(select min(date)
from t t2
where t2.date > t.date
) as last_date
from t
union all
select price, dateadd(day, 1, date), last_date
from cte
where date < last_date
)
select price, date
from cte
order by date;

Assuming there is a dates table (if not you can easily make one), you can do this by left joining the existing table to the dates table. Thereafter assign groups per dates found using a running sum. The max value per group is what would be needed to fill in the missing values.
select dt,max(price) over(partition by grp) as price
from (select p.price,d.dt,sum(case when p.dt is null then 0 else 1 end) over(order by d.dt) as grp
from dates d
left join prices p on p.dt = d.dt
) t
Sample Demo
Making a dates table with a recursive cte. Persist it as needed.
--Generate dates in 2019
with dates(dt) as (select cast('2019-01-01' as date)
union all
select dateadd(day,1,dt)
from dates
where dt < '2019-12-31'
)
select * from dates
option(maxrecursion 0)

cummulative distinct count

I'm having trouble getting a cumulative distinct count so let's just assume the below dataset.
DATE RID
1/1/18 1
1/1/18 2
1/1/18 3
1/1/18 3
So if we run this query
SELECT DATE, COUNT(DISTINCT RID) FROM TABLE;
we would expect it to return 3, however let's assume that the data for the next day is as follows.
DATE RID
1/2/18 1
1/2/18 6
1/2/18 9
How would you write a query to get the following results where the data for 1/1/18 is considered when returning the distinct for 1/2/18.
So it would be the following results.
Date Count(*)
1/1/18 3
1/2/18 5 <- 1/1/18 distinct plus + 1/2 distinct.
Hope that makes sense, keep in mind this is a very large dataset if that changes things.

You can do a cumulative count of the earliest date for each rid:
select mindate, count(*), sum(count(*)) over (order by mindate)
from (select rid, min(date) as mindate
from t
group by rid
) t
group by mindate
order by mindate;
Note: This will be missing dates that is not a mindate for some rid. Here is one way to get all the dates, if that is an issue:
select mindate, count(rid), sum(count(rid)) over (order by mindate)
from ((select rid, min(date) as mindate
from t
group by rid
)
union all
(select distinct NULL, date
from t
)
) rd
group by mindate
order by mindate;

Below query can give required cumulative distinct count.
--Step 3:
SELECT dt,
cum_distinct_cnt
FROM (
--Step 2:
SELECT rid,
dt,
COUNT(CASE WHEN row_num = 1 THEN rid END) OVER (ORDER BY dt ROWS BETWEEN Unbounded PRECEDING AND CURRENT ROW) cum_distinct_cnt
FROM (
--Step 1:
SELECT rid,
dt,
ROW_NUMBER() OVER (PARTITION BY rid ORDER BY dt) row_num
FROM table) innerTab1
) innerTab2
QUALIFY ROW_NUMBER() OVER (PARTITION BY dt ORDER BY cum_distinct_cnt DESC) = 1
Since your dataset is very large, you can break the below query on steps as explained in query and create work tables to populate innerTab1/ innerTab2 to get final output

SQL: Select Multiple Columns with Max() on calculated values

Real basic: I have table T with following data:
ID StartDate Term (months)
----------------------
1 10/1/2012 12
2 10/1/2012 24
3 12/1/2012 12
I need to know the ID of the row that has the max end date. I've successfully calculated the end date as
select max( DateAdd(month, term, StartDate) from table [this would result in 10/1/2014]
how do i get the ID value and Start Date of the row that contains the max end date?

MS SQL:
SELECT TOP 1 ID, StartDate
FROM T
ORDER BY DateAdd(month, term, StartDate) DESC
MySQL:
SELECT ID, StartDate
FROM T
ORDER BY DateAdd(month, term, StartDate) DESC
LIMIT 1

In case more than one ID has the same extreme "end date" and you need them all, you can try this:
SELECT x.id
FROM (
SELECT id
, RANK ( ) OVER ( ORDER BY DateAdd(month, term, StartDate) DESC) as rn
FROM T
) x
WHERE t.rn = 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Query - Combine rows based on multiple columns - sql

Related

How can i group rows on sql base on condition

Oracle SQL LAG() function results in duplicate rows

Fill missing gaps in data using a date column

cummulative distinct count

SQL: Select Multiple Columns with Max() on calculated values

Categories

Resources