Find nearest next date based on first row date - sql

I have a table in postgresql db as follows:
sl_no | valid_from |
--------------------
1 02-04-2013
2 02-09-2012
3 02-11-2015
4 02-01-2011
5 02-10-2015
I want to get all rows orderby valid_from and along with one dummy column name as valid_to. Here, values of valid_to should come from the nearest next date of every valid_from value.
Something like below:
sl_no | valid_from | valid_to |
---------------------------------
4 02-01-2011 02-09-2012
2 02-09-2012 02-04-2013
1 02-04-2013 02-10-2015
5 02-10-2015 02-11-2015
3 02-11-2015 02-11-2015
Thanks..

The lead() will do that:
select sl_no, valid_from,
lead(valid_from, 1, valid_from) over (order by valid_from) as valid_to
from the_table
order by valid_from;
lead() picks the column value of specified column of the next row (defined by the order by). The parameters 1, valid_from specify that the database should look 1 row ahead and in case there is no such row, the third parameter is returned. lead(valid_from) is a short form of lead(valid_from, 1, null).
Set the manual for details:
http://www.postgresql.org/docs/current/static/tutorial-window.html
http://www.postgresql.org/docs/current/static/functions-window.html
SQLFiddle examle: http://sqlfiddle.com/#!15/61d53/1

Related

Aggregate in plsql

ORGANIZATION_ID
BAY_ID
CASCADE_GROUP_ID
DOWNSTEAM_VALUE
1001
100012
1
2
1001
100014
1
4
1001
100016
1
6
1001
100018
1
8
I need to create a view by aggregating the values of the DOWNSTEAM_VALUE column mentioned in the above table. In the below example, the aggregation at the DOWNSTEAM_VALUE column should happen by looking at the BAY_ID. If in case, the first row containing BAY_ID is 100012, the downstream value should be calculated by adding up the DOWNSTEAM_VALUE of the current BAY_ID row + remaining DOWNSTEAM_VALUE values in ascending order such as 2+4+6+8 and show like 20 and same goes to next BAY_ID , the downstream value would be 4+6+8=18. Since the last BAY_ID doesn't have any more DOWNSTEAM_VALUE values to add, it should show 8.
ORGANIZATION_ID
BAY_ID
CASCADE_GROUP_ID
DOWNSTEAM_VALUE
1001
100012
1
20
1001
100014
1
18
1001
100016
1
14
1001
100018
1
8
Any help would be really appreciated. Thanks
You can use SUM analytic function with windowing clause for that like below.
select ORGANIZATION_ID
, BAY_ID
, CASCADE_GROUP_ID
, sum(DOWNSTEAM_VALUE)over(
partition by ORGANIZATION_ID, CASCADE_GROUP_ID
order by BAY_ID asc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED
FOLLOWING) as DOWNSTEAM_VALUE
from your_table
;

adjust date overlaps within a group

I have this table and I want to adjust END_DATE one day prior to the next ST_DATE in case if there are overlap dates for a group of ID
TABLE HAVE
ID ST_DATE END_DATE
1 2020-01-01 2020-02-01
1 2020-05-10 2020-05-20
1 2020-05-18 2020-06-19
1 2020-11-11 2020-12-01
2 1999-03-09 1999-05-10
2 1999-04-09 2000-05-10
3 1999-04-09 2000-05-10
3 2000-06-09 2000-08-16
3 2000-08-17 2009-02-17
Below is what I'm looking for
TABLE WANT
ID ST_DATE END_DATE
1 2020-01-01 2020-02-01
1 2020-05-10 2020-05-17 =====changed to a day less than the next ST_DATE due to some sort of overlap
1 2020-05-18 2020-06-19
1 2020-11-11 2020-12-01
2 1999-03-09 1999-04-08 =====changed to a day less than the next ST_DATE due to some sort of overlap
2 1999-04-09 2000-05-10
3 1999-04-09 2000-05-10
3 2000-06-09 2000-08-16
3 2000-08-17 2009-02-17
Maybe you can use LEAD() for this. Initial idea:
select
id, st_date, end_date
, lead( st_date ) over ( partition by id order by st_date ) nextstart_
from overlap
;
-- result
ID ST_DATE END_DATE NEXTSTART
---------- --------- --------- ---------
1 01-JAN-20 01-FEB-20 10-MAY-20
1 10-MAY-20 20-MAY-20 18-MAY-20
1 18-MAY-20 19-JUN-20 11-NOV-20
1 11-NOV-20 01-DEC-20
2 09-MAR-99 10-MAY-99 09-APR-99
2 09-APR-99 10-MAY-00
3 09-APR-99 10-MAY-00 09-JUN-00
3 09-JUN-00 16-AUG-00 17-AUG-00
3 17-AUG-00 17-FEB-09
Once you have the next start date and the end_date side by side (as it were),
you can use CASE ... for adjusting the dates as you need them.
select ilv.id, ilv.st_date
, case
when ilv.end_date > ilv.nextstart_ then
to_char( ilv.nextstart_ - 1 ) || ' <- modified end date'
else
to_char( ilv.end_date )
end dt_modified
from (
select
id, st_date, end_date
, lead( st_date ) over ( partition by id order by st_date ) nextstart_
from overlap
) ilv
;
ID ST_DATE DT_MODIFIED
---------- --------- ---------------------------------------
1 01-JAN-20 01-FEB-20
1 10-MAY-20 17-MAY-20 <- modified end date
1 18-MAY-20 19-JUN-20
1 11-NOV-20 01-DEC-20
2 09-MAR-99 08-APR-99 <- modified end date
2 09-APR-99 10-MAY-00
3 09-APR-99 10-MAY-00
3 09-JUN-00 16-AUG-00
3 17-AUG-00 17-FEB-09
DBfiddle here.
If two "windows" for the same id have the same start date, then the problem doesn't make sense. So, let's assume that the problem makes sense - that is, the combination (id, st_date) is unique in the inputs.
Then, the problem can be formulated as follows: for each id, order rows by st_date ascending. Then, for each row, if its end_dt is less than the following st_date, return the row as is. Otherwise replace end_dt with the following st_date, minus 1. This last step can be achieved with the analytic lead() function.
A solution might look like this:
select id, st_date,
least(end_date, lead(st_date, 1, end_date + 1)
over (partition by id order by st_date) - 1) as end_date
from have
;
The bit about end_date + 1 in the lead function handles the last row for each id. For such rows there is no "next" row, so the default application of lead will return null. The default can be overridden by using the third parameter to the function.

Query to find value in column dependent on a different column in table being the minimum date

I have a dataset that looks like this. I would like to pull a distinct id, the minimum date and value on the minimum date.
id date value
1 01/01/2020 0.5
1 02/01/2020 1
1 03/01/2020 2
2 01/01/2020 3
2 02/01/2020 4
2 03/01/2020 5
This code will pull the id and the minimum date
select Distinct(id), min(nav_date)
from table
group by id
How can I get the value on the minimum date so the output of my query looks like this?
id date value
1 01/01/2020 0.5
2 01/01/2020 3
Use distinct on:
select distinct on (id) t.*
from t
order by id, date;
This can take advantage of an index on (id, date) and is typically the fastest way to do this operation in Postgres.

What is the role of ORDER BY in the PARTITION BY function?

I have a table with data follow,
ID SEQ EFFDAT
------- --------- -----------------------
1024 1 01/07/2010 12:00:00 AM
1024 3 18/04/2017 12:00:00 AM
1024 2 01/08/2017 12:00:00 AM
When I execute the following query, I am getting wrong maximum sequence still I am getting the correct maximum effective date.
Query:
SELECT
max(seq) over (partition by id order by EFFDAT desc) maxEffSeq,
partitionByTest.*,
max(EFFDAT) over (partition by (id) order by EFFDAT desc ) maxeffdat
FROM partitionByTest;
Output:
MAXEFFSEQ ID SEQ EFFDAT MAXEFFDAT
---------- ---------- ---------- ------------------------ ------------------------
2 1024 2 01/08/2017 12:00:00 AM 01/08/2017 12:00:00 AM
3 1024 3 18/04/2017 12:00:00 AM 01/08/2017 12:00:00 AM
3 1024 1 01/07/2010 12:00:00 AM 01/08/2017 12:00:00 AM
If I remove the order by in my query, I am getting the correct output.
Query:
SELECT max(seq) over (partition by id ) maxEffSeq, partitionByTest.*,
max(EFFDAT) over (partition by (id) order by EFFDAT desc ) maxeffdat
FROM partitionByTest;
Output:
MAXEFFSEQ ID SEQ EFFDAT MAXEFFDAT
---------- ---------- ---------- ------------------------ ------------------------
3 1024 2 01/08/2017 12:00:00 AM 01/08/2017 12:00:00 AM
3 1024 3 18/04/2017 12:00:00 AM 01/08/2017 12:00:00 AM
3 1024 1 01/07/2010 12:00:00 AM 01/08/2017 12:00:00 AM
I know that when we are using MAX function, it is not required to use order by clause. But I am interested to know how order by works in partition by function and why it is giving the wrong result for sequence and correct result for date when I use order by clause ?
Adding an order by also implies a windowing clause, and as you have't specified one you get the default, so you're really doing:
max(seq) over (
partition by id
order by EFFDAT desc
range between unbounded preceding and current row
)
If you think about how the data looks if you order it in the same way, by descending date:
select partitionbytest.*,
count(*) over (partition by id order by effdat desc) range_rows,
max(seq) over (partition by id order by effdat desc) range_max_seq,
count(*) over (partition by id) id_rows,
max(seq) over (partition by id) id_max_seq
from partitionbytest
order by effdat desc;
ID SEQ EFFDAT RANGE_ROWS RANGE_MAX_SEQ ID_ROWS ID_MAX_SEQ
---------- ---------- ---------- ---------- ------------- ---------- ----------
1024 2 2017-08-01 1 2 3 3
1024 3 2017-04-18 2 3 3 3
1024 1 2010-07-01 3 3 3 3
then it becomes a bit clearer. I've included equivalent analytic counts so you can also see how many rows are being considered, with and without the order by clause.
For the first row the max seq value is found from looking at that current row's data and all preceding rows with later dates (as it's descending), and there are none of those, so it is the value from that row itself - so it's 2. The rows following it it, with seq values 3 and 1, are not considered.
For the second row it looks at the current row and all preceding rows with later dates, so it can consider both the preceding value of 2 and the current value of 3. Since 3 is highest among those, it shows that. The row following it it, with seq value 1, is not considered.
For the third row it looks at the current row and all preceding rows with later dates, so it can consider the preceding values of 2 and 3 and the current value of 1. Since 3 is still highest it shows that again.
Without the order by clause it always considers all values for that ID, so it sees 3 as the highest for all of them.
See the documentation for analytic functions for more details of how this is determined, partitularly:
The group of rows is called a window and is defined by the analytic_clause. For each row, a sliding window of rows is defined. The window determines the range of rows used to perform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time.
and
You cannot specify [windowing_clause] unless you have specified the order_by_clause.
and
If you omit the windowing_clause entirely, then the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
This is correct, although it seems very strange.
The order by clause which is permitted on the MAX, is a window function that allow for the order function to also contain a windowing clause - so by specifying an order by clause you then pick up what the default behaviour of the windowing clause would be (since you did not specify it).
The default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Docs : https://docs.oracle.com/database/121/SQLRF/functions004.htm#SQLRF06174
If you omit the windowing_clause entirely, then the default is RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

How to add a running count to rows in a 'streak' of consecutive days

Thanks to Mike for the suggestion to add the create/insert statements.
create table test (
pid integer not null,
date date not null,
primary key (pid, date)
);
insert into test values
(1,'2014-10-1')
, (1,'2014-10-2')
, (1,'2014-10-3')
, (1,'2014-10-5')
, (1,'2014-10-7')
, (2,'2014-10-1')
, (2,'2014-10-2')
, (2,'2014-10-3')
, (2,'2014-10-5')
, (2,'2014-10-7');
I want to add a new column that is 'days in current streak'
so the result would look like:
pid | date | in_streak
-------|-----------|----------
1 | 2014-10-1 | 1
1 | 2014-10-2 | 2
1 | 2014-10-3 | 3
1 | 2014-10-5 | 1
1 | 2014-10-7 | 1
2 | 2014-10-2 | 1
2 | 2014-10-3 | 2
2 | 2014-10-4 | 3
2 | 2014-10-6 | 1
I've been trying to use the answers from
PostgreSQL: find number of consecutive days up until now
Return rows of the latest 'streak' of data
but I can't work out how to use the dense_rank() trick with other window functions to get the right result.
Building on this table (not using the SQL keyword "date" as column name.):
CREATE TABLE tbl(
pid int
, the_date date
, PRIMARY KEY (pid, the_date)
);
Query:
SELECT pid, the_date
, row_number() OVER (PARTITION BY pid, grp ORDER BY the_date) AS in_streak
FROM (
SELECT *
, the_date - '2000-01-01'::date
- row_number() OVER (PARTITION BY pid ORDER BY the_date) AS grp
FROM tbl
) sub
ORDER BY pid, the_date;
Subtracting a date from another date yields an integer. Since you are looking for consecutive days, every next row would be greater by one. If we subtract row_number() from that, the whole streak ends up in the same group (grp) per pid. Then it's simple to deal out number per group.
grp is calculated with two subtractions, which should be fastest. An equally fast alternative could be:
the_date - row_number() OVER (PARTITION BY pid ORDER BY the_date) * interval '1d' AS grp
One multiplication, one subtraction. String concatenation and casting is more expensive. Test with EXPLAIN ANALYZE.
Don't forget to partition by pid additionally in both steps, or you'll inadvertently mix groups that should be separated.
Using a subquery, since that is typically faster than a CTE. There is nothing here that a plain subquery couldn't do.
And since you mentioned it: dense_rank() is obviously not necessary here. Basic row_number() does the job.
You'll get more attention if you include CREATE TABLE statements and INSERT statements in your question.
create table test (
pid integer not null,
date date not null,
primary key (pid, date)
);
insert into test values
(1,'2014-10-1'), (1,'2014-10-2'), (1,'2014-10-3'), (1,'2014-10-5'),
(1,'2014-10-7'), (2,'2014-10-1'), (2,'2014-10-2'), (2,'2014-10-3'),
(2,'2014-10-5'), (2,'2014-10-7');
The principle is simple. A streak of distinct, consecutive dates minus row_number() is a constant. You can group by the constant, and take the dense_rank() over that result.
with grouped_dates as (
select pid, date,
(date - (row_number() over (partition by pid order by date) || ' days')::interval)::date as grouping_date
from test
)
select * , dense_rank() over (partition by grouping_date order by date) as in_streak
from grouped_dates
order by pid, date
pid date grouping_date in_streak
--
1 2014-10-01 2014-09-30 1
1 2014-10-02 2014-09-30 2
1 2014-10-03 2014-09-30 3
1 2014-10-05 2014-10-01 1
1 2014-10-07 2014-10-02 1
2 2014-10-01 2014-09-30 1
2 2014-10-02 2014-09-30 2
2 2014-10-03 2014-09-30 3
2 2014-10-05 2014-10-01 1
2 2014-10-07 2014-10-02 1