Oracle SQL LAG() function results in duplicate rows - sql

I have a very simple query that results in two rows:
SELECT DISTINCT
id,
trunc(start_date) start_date
FROM example.table
WHERE ID = 1
This results in the following rows:
id start_date
1 7/1/2012
1 9/1/2016
I want to add a column that simply shows the previous date for each row. So I'm using the following:
SELECT DISTINCT id,
Trunc(start_date) start_date,
Lag(start_date, 1)
over (
ORDER BY start_date) pdate
FROM example.table
WHERE id = 1
However, when I do this, I get four rows instead of two:
id start_date pdate
1 7/1/2012 NULL
1 7/1/2012 7/1/2012
1 9/1/2016 7/1/2012
1 9/1/2016 9/1/2012
If I change the offset to 2 or 3 the results remain the same. If I change the offset to 0, I get two rows again but of course now the start_date == pdate.
I can't figure out what's going on

Use an explicit GROUP BY instead:
SELECT id, trunc(start_date) as start_date,
LAG(trunc(start_date)) OVER (PARTITION BY id ORDER BY trunc(start_date))
FROM example.table
WHERE ID = 1
GROUP BY id, trunc(start_date)

The reason for this is: the order of execution of an SQL statements, is that LAG runs before the DISTINCT.
You actually want to run the LAG after the DISTINCT, so the right query should be:
WITH t1 AS (
SELECT DISTINCT id, trunc(start_date) start_date
FROM example.table
WHERE ID = 1
)
SELECT *, LAG(start_date, 1) OVER (ORDER BY start_date) pdate
FROM t1

Related

How can i group rows on sql base on condition

I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.
This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;
Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date

SQL Query - Combine rows based on multiple columns

On the image above, I'd like to combine rows with the same value on consecutive days.
Combined rows will have the earliest date on From column and the latest date on To column.
Looking at the example, even if Rows 3 and 4 have the same value, they were not combined because of the date gap.
I've tried using LAG and LEAD functions but no luck.
You can try below way -
DEMO
with c as
(
select *, datediff(dd,todate,laedval) as leaddiff,
datediff(dd,todate,lagval) as lagdiff
from
(
select *,lead(todate) over(partition by value order by todate) laedval,
lag(todate) over(partition by value order by todate) lagval
from t1
)A
)
select * from
(
select value,min(todate) as fromdate,max(todate) as todate from c
where coalesce(leaddiff,0)+coalesce(lagdiff,0) in (1,-1)
group by value
union all
select value,fromdate,todate from c
where coalesce(leaddiff,0)+coalesce(lagdiff,0)>1 or coalesce(leaddiff,0)+coalesce(lagdiff,0)<-1
)A order by value
OUTPUT:
value fromdate todate
1 16/07/2019 00:00:00 17/07/2019 00:00:00
3 21/07/2019 00:00:00 26/07/2019 00:00:00
2 18/07/2019 00:00:00 18/07/2019 00:00:00
2 20/07/2019 00:00:00 20/07/2019 00:00:00
I am going to recommend the following approach:
Find where each new group begins. You can do this by comparing the previous maximum todate with the fromdate in this row.
Do a cumulative sum of the starts to define a group.
Aggregate the results.
This can be handled using window functions and aggregation:
select value, min(fromdate) as fromdate, max(todate) as todate
from (select t.*,
sum(case when prev_todate >= dateadd(day, -1, fromdate)
then 0 -- overlap, so this does not begin a new group
else 1 -- no overlap, so this does begin a new group
end) over
(partition by value order by fromdate) as grp
from (select t.*,
max(todate) over (partition by value
order by fromdate
rows between unbounded preceding and 1 preceding
) as prev_todate
from t
) t
) t
group by value, grp
order by value, min(fromdate);
Here is a db<>fiddle.

cummulative distinct count

I'm having trouble getting a cumulative distinct count so let's just assume the below dataset.
DATE RID
1/1/18 1
1/1/18 2
1/1/18 3
1/1/18 3
So if we run this query
SELECT DATE, COUNT(DISTINCT RID) FROM TABLE;
we would expect it to return 3, however let's assume that the data for the next day is as follows.
DATE RID
1/2/18 1
1/2/18 6
1/2/18 9
How would you write a query to get the following results where the data for 1/1/18 is considered when returning the distinct for 1/2/18.
So it would be the following results.
Date Count(*)
1/1/18 3
1/2/18 5 <- 1/1/18 distinct plus + 1/2 distinct.
Hope that makes sense, keep in mind this is a very large dataset if that changes things.
You can do a cumulative count of the earliest date for each rid:
select mindate, count(*), sum(count(*)) over (order by mindate)
from (select rid, min(date) as mindate
from t
group by rid
) t
group by mindate
order by mindate;
Note: This will be missing dates that is not a mindate for some rid. Here is one way to get all the dates, if that is an issue:
select mindate, count(rid), sum(count(rid)) over (order by mindate)
from ((select rid, min(date) as mindate
from t
group by rid
)
union all
(select distinct NULL, date
from t
)
) rd
group by mindate
order by mindate;
Below query can give required cumulative distinct count.
--Step 3:
SELECT dt,
cum_distinct_cnt
FROM (
--Step 2:
SELECT rid,
dt,
COUNT(CASE WHEN row_num = 1 THEN rid END) OVER (ORDER BY dt ROWS BETWEEN Unbounded PRECEDING AND CURRENT ROW) cum_distinct_cnt
FROM (
--Step 1:
SELECT rid,
dt,
ROW_NUMBER() OVER (PARTITION BY rid ORDER BY dt) row_num
FROM table) innerTab1
) innerTab2
QUALIFY ROW_NUMBER() OVER (PARTITION BY dt ORDER BY cum_distinct_cnt DESC) = 1
Since your dataset is very large, you can break the below query on steps as explained in query and create work tables to populate innerTab1/ innerTab2 to get final output

Set date to last day of previous month in Oracle if it's not the last row

Whole setup on SQL Fiddle: http://sqlfiddle.com/#!4/1fd0e/5
I have some data containing persons id, level and the levels date range like shown below:
PID LVL START_DATE END_DATE
1 1 01.01.14 19.03.14
1 2 20.03.14 15.08.14
1 3 16.08.14 09.10.14
1 4 10.10.14 31.12.14
2 1 01.01.14 31.12.14
3 1 01.01.14 16.01.14
I need to set the start date to the first day of month and the end date to the last day of month. the last day rule applies only if it ist not the last row of data for that person.
what i've don so far:
select
pid, lvl,
trunc(start_date, 'month') as start_date,
case when lead(pid, 1) over (PARTITION BY pid order by end_date) is not null
then last_day(add_months(end_date, -1))
else last_day(end_date)
end as end_date
from date_tbl t;
gives me the desired output:
PID LVL START_DATE END_DATE
1 1 01.01.14 28.02.14
1 2 01.03.14 31.07.14
1 3 01.08.14 30.09.14
1 4 01.10.14 31.12.14
2 1 01.01.14 31.12.14
3 1 01.01.14 31.01.14
BUT: It just works well with my test-data. On my production data on a table containing 25k+ rows of data (witch is not too much data i'd say) it performs really slow.
Can anyone give me a hint how I could improve the query's performance? What indices to set on wich columns for example...? The only indexed column so far is the PID column.
Actually, as I understand, your script produces wrong result if person has only one record (case with pid = 3)
Please, could you try this one?
select
pid,
lvl,
trunc(start_date, 'month') as start_date,
last_day(add_months(end_date, case when lvl = max(lvl) over (partition by pid) then 0 else -1 end)) end_date
from date_tbl t;
I guess that you need to build index for columns (pid, lvl desc)
Ok guys, sorry for waisting your time. To make it short: it was my fault. In my procedure the query above makes a LEFT JOIN to another table in some subquery:
with dates as (
select
pid, lvl,
trunc(start_date, 'month') as start_date,
case when lead(pid, 1) over (PARTITION BY pid order by end_date) is not null
then last_day(add_months(end_date, -1))
else last_day(end_date)
end as end_date
from date_tbl t
),
some_other_table as (
select pid, (...some more columns)
from other_table
)
select * from (
select
b.pid, -- <== this has to be a.pid. b is much bigger than a!
a.start_date,
a.end_date
from dates a left join some_other_table b on a.pid = b.pid
)
The whole query is much bigger.
#jonearles
thx for your comment. "And what is the full query?" helped me to get back on track: split the query into pieces and check again what REALLY slows it down.

SQL: Select Multiple Columns with Max() on calculated values

Real basic: I have table T with following data:
ID StartDate Term (months)
----------------------
1 10/1/2012 12
2 10/1/2012 24
3 12/1/2012 12
I need to know the ID of the row that has the max end date. I've successfully calculated the end date as
select max( DateAdd(month, term, StartDate) from table [this would result in 10/1/2014]
how do i get the ID value and Start Date of the row that contains the max end date?
MS SQL:
SELECT TOP 1 ID, StartDate
FROM T
ORDER BY DateAdd(month, term, StartDate) DESC
MySQL:
SELECT ID, StartDate
FROM T
ORDER BY DateAdd(month, term, StartDate) DESC
LIMIT 1
In case more than one ID has the same extreme "end date" and you need them all, you can try this:
SELECT x.id
FROM (
SELECT id
, RANK ( ) OVER ( ORDER BY DateAdd(month, term, StartDate) DESC) as rn
FROM T
) x
WHERE t.rn = 1