What is the role of ORDER BY in the PARTITION BY function?

What is the role of ORDER BY in the PARTITION BY function? - sql

I have a table with data follow,
ID SEQ EFFDAT
------- --------- -----------------------
1024 1 01/07/2010 12:00:00 AM
1024 3 18/04/2017 12:00:00 AM
1024 2 01/08/2017 12:00:00 AM
When I execute the following query, I am getting wrong maximum sequence still I am getting the correct maximum effective date.
Query:
SELECT
max(seq) over (partition by id order by EFFDAT desc) maxEffSeq,
partitionByTest.*,
max(EFFDAT) over (partition by (id) order by EFFDAT desc ) maxeffdat
FROM partitionByTest;
Output:
MAXEFFSEQ ID SEQ EFFDAT MAXEFFDAT
---------- ---------- ---------- ------------------------ ------------------------
2 1024 2 01/08/2017 12:00:00 AM 01/08/2017 12:00:00 AM
3 1024 3 18/04/2017 12:00:00 AM 01/08/2017 12:00:00 AM
3 1024 1 01/07/2010 12:00:00 AM 01/08/2017 12:00:00 AM
If I remove the order by in my query, I am getting the correct output.
Query:
SELECT max(seq) over (partition by id ) maxEffSeq, partitionByTest.*,
max(EFFDAT) over (partition by (id) order by EFFDAT desc ) maxeffdat
FROM partitionByTest;
Output:
MAXEFFSEQ ID SEQ EFFDAT MAXEFFDAT
---------- ---------- ---------- ------------------------ ------------------------
3 1024 2 01/08/2017 12:00:00 AM 01/08/2017 12:00:00 AM
3 1024 3 18/04/2017 12:00:00 AM 01/08/2017 12:00:00 AM
3 1024 1 01/07/2010 12:00:00 AM 01/08/2017 12:00:00 AM
I know that when we are using MAX function, it is not required to use order by clause. But I am interested to know how order by works in partition by function and why it is giving the wrong result for sequence and correct result for date when I use order by clause ?

Adding an order by also implies a windowing clause, and as you have't specified one you get the default, so you're really doing:
max(seq) over (
partition by id
order by EFFDAT desc
range between unbounded preceding and current row
)
If you think about how the data looks if you order it in the same way, by descending date:
select partitionbytest.*,
count(*) over (partition by id order by effdat desc) range_rows,
max(seq) over (partition by id order by effdat desc) range_max_seq,
count(*) over (partition by id) id_rows,
max(seq) over (partition by id) id_max_seq
from partitionbytest
order by effdat desc;
ID SEQ EFFDAT RANGE_ROWS RANGE_MAX_SEQ ID_ROWS ID_MAX_SEQ
---------- ---------- ---------- ---------- ------------- ---------- ----------
1024 2 2017-08-01 1 2 3 3
1024 3 2017-04-18 2 3 3 3
1024 1 2010-07-01 3 3 3 3
then it becomes a bit clearer. I've included equivalent analytic counts so you can also see how many rows are being considered, with and without the order by clause.
For the first row the max seq value is found from looking at that current row's data and all preceding rows with later dates (as it's descending), and there are none of those, so it is the value from that row itself - so it's 2. The rows following it it, with seq values 3 and 1, are not considered.
For the second row it looks at the current row and all preceding rows with later dates, so it can consider both the preceding value of 2 and the current value of 3. Since 3 is highest among those, it shows that. The row following it it, with seq value 1, is not considered.
For the third row it looks at the current row and all preceding rows with later dates, so it can consider the preceding values of 2 and 3 and the current value of 1. Since 3 is still highest it shows that again.
Without the order by clause it always considers all values for that ID, so it sees 3 as the highest for all of them.
See the documentation for analytic functions for more details of how this is determined, partitularly:
The group of rows is called a window and is defined by the analytic_clause. For each row, a sliding window of rows is defined. The window determines the range of rows used to perform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time.
and
You cannot specify [windowing_clause] unless you have specified the order_by_clause.
and
If you omit the windowing_clause entirely, then the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

This is correct, although it seems very strange.
The order by clause which is permitted on the MAX, is a window function that allow for the order function to also contain a windowing clause - so by specifying an order by clause you then pick up what the default behaviour of the windowing clause would be (since you did not specify it).
The default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Docs : https://docs.oracle.com/database/121/SQLRF/functions004.htm#SQLRF06174
If you omit the windowing_clause entirely, then the default is RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Related

Most Efficient SQL to Calculate Running Streak Occurrences

I am looking for the most efficient manner to determine the longest occurrence of a streak within a given data set; specifically, to determine the longest winning streak of games.
Below is the SQL that I have thus far, and it does seem to perform very fast, and as expected from the limited testing I've done on a dataset with around 100,000 records.
DECLARE #HistoryDateTimeLimit datetime = '3/15/2018';
CTE to create result subset from voting dataset.
WITH Results AS (
SELECT
EntityPlayerId,
(CASE
WHEN VoteTeamA = 1 AND ParticipantAScore > ParticipantBScore THEN 'W'
WHEN VoteTeamA = 0 AND ParticipantBScore > ParticipantAScore THEN 'W'
ELSE 'L'
END) AS WinLoss,
match.ScheduledStartDateTime
FROM
[dbo].[MatchVote] vote
INNER JOIN [dbo].[MatchMetaData] match ON vote.MatchId = match.MatchId
WHERE
IsComplete = 1
AND ScheduledStartDateTime >= #HistoryDateTimeLimit
)
CTE to create subset of data with streak type as WinLoss and total count of votes in the partition using ROW_NUMBER().
WITH Streaks AS (
SELECT
EntityPlayerId,
ScheduledStartDateTime,
WinLoss,
ROW_NUMBER() OVER (PARTITION BY EntityPlayerId ORDER BY ScheduledStartDateTime) -
ROW_NUMBER() OVER (PARTITION BY EntityPlayerId, WinLoss ORDER BY ScheduledStartDateTime) AS Streak
FROM
Results
)
CTE to summarize the partitioned vote streaks by WinLoss and a begin date/time, with the total count in the streak.
WITH StreakCounts AS (
SELECT
EntityPlayerId,
WinLoss,
MIN(ScheduledStartDateTime) StreakStart,
MAX(ScheduledStartDAteTime) StreakEnd,
COUNT(*) as Streak
FROM
Streaks
GROUP BY
EntityPlayerId, WinLoss, Streak
)
CTE to select the MAXIMUM (longest) vote streak for WinLoss of W (win) grouped by players.
WITH LongestWinStreak AS (
SELECT
EntityPlayerId,
MAX(Streak) AS LongestStreak
FROM
StreakCounts
WHERE
WinLoss = 'W'
GROUP BY
EntityPlayerId
)
Selecting the useful data from the LongestWinStreak CTE.
SELECT * FROM LongestWinStreak
This is the 3rd iteration of the code; at first I feel like I was overthinking and using windows with the LAG function to define a reset period that was later used for partitioning.
[UPDATE]: SQLFiddle example at http://sqlfiddle.com/#!18/5b33a/1 -- Sample data for the two tables that are used above are as follows.
The data is meant to show the schema, and can be extrapolated for your own testing/usage;
MatchVote table data.
EntityPlayerId IsExtMatch MatchId VoteTeamA VoteDateTime IsComplete
-------------------- ------------ -------------------- --------- ----------------------- ----------
158 1 152639 0 2018-03-20 23:25:28.910 1
158 1 156058 1 2018-03-13 23:36:57.517 1
MatchMetaData table data.
MatchId IsTeamTournament MatchCompletedDateTime ScheduledStartDateTime MatchIsFinalized TournamentId TournamentTitle TournamentLogoUrl TournamentLogoThumbnailUrl GameName GameShortCode GameLogoUrl ParticipantAScore ParticipantAName ParticipantALogoUrl ParticipantBScore ParticipantBName ParticipantBLogoUrl
--------- ---------------- ----------------------- ----------------------- ---------------- -------------------- ------------------ ----------------------- ---------------------------- --------------------------------- -------------- ----------------------- ------------------ ------------------- --------------------- ----------------- ------------------- --------------------
23354 1 2014-07-30 00:30:00.000 2014-07-30 00:00:00.000 1 543 Sample https://...Small.png https://...Small.png Dota 2 Dota 2 https://...logo.png 3 Natus Vincere.US https://...VI.png 0 Not Today https://...ay.png
44324 1 2014-12-15 12:40:00.000 2014-12-15 11:40:00.000 1 786 Sample https://...Small.png https://...Small.png Counter-Strike: Global Offensive CS:GO https://...logo.png 0 Avalier's stars https://...oto.png 1 Kassad's Legends https://...oto.png

How to get rolling MIN number for all rest rows(include current rows) BY category

I have a data table as below, which sorted by data, route_number and sequence.
Delivery Date Order_ID Route_Number Stop # Sequence Min Stop# Formula
12/11/2017 Z11 100201 2 1 1 MIN(D2:$D$6)
12/11/2017 Z12 100201 1 2 1 MIN(D3:$D$6)
12/11/2017 Z13 100201 3 3 3 MIN(D4:$D$6)
12/11/2017 Z14 100201 5 4 4 MIN(D5:$D$6)
12/11/2017 Z15 100201 4 5 4 MIN(D6:$D$6)
What I am trying to do is in my SQL query, how can I get the column Min Stop# as I can in the excel.
The logic is: give me the min stop# from current row to all rest rows in same route_number,and delivery date, I am thinking something like Partition by delivery_date, route_number.
Does anyone has some ideas?
Thanks

Use min window function.
select t.*,min(stop) over(partition by route_number,delivery_date
order by sequence rows between current row
and unbounded following) as min_stop
from tbl t

min(stop) over (partition by route_number, delivery_date
order by sequence rows between current row and unbounded following)
or
min(stop) over (partition by route_number, delivery_date
order by sequence desc rows between unbounded preceding and current row)
which can be simplified to
min(stop) over (partition by route_number, delivery_date
order by sequence desc) m2
because rows between unbounded preceding and current row is the default window when you use ordering in over clause.

Find nearest next date based on first row date

I have a table in postgresql db as follows:
sl_no | valid_from |
--------------------
1 02-04-2013
2 02-09-2012
3 02-11-2015
4 02-01-2011
5 02-10-2015
I want to get all rows orderby valid_from and along with one dummy column name as valid_to. Here, values of valid_to should come from the nearest next date of every valid_from value.
Something like below:
sl_no | valid_from | valid_to |
---------------------------------
4 02-01-2011 02-09-2012
2 02-09-2012 02-04-2013
1 02-04-2013 02-10-2015
5 02-10-2015 02-11-2015
3 02-11-2015 02-11-2015
Thanks..

The lead() will do that:
select sl_no, valid_from,
lead(valid_from, 1, valid_from) over (order by valid_from) as valid_to
from the_table
order by valid_from;
lead() picks the column value of specified column of the next row (defined by the order by). The parameters 1, valid_from specify that the database should look 1 row ahead and in case there is no such row, the third parameter is returned. lead(valid_from) is a short form of lead(valid_from, 1, null).
Set the manual for details:
http://www.postgresql.org/docs/current/static/tutorial-window.html
http://www.postgresql.org/docs/current/static/functions-window.html
SQLFiddle examle: http://sqlfiddle.com/#!15/61d53/1

How to add a running count to rows in a 'streak' of consecutive days

Thanks to Mike for the suggestion to add the create/insert statements.
create table test (
pid integer not null,
date date not null,
primary key (pid, date)
);
insert into test values
(1,'2014-10-1')
, (1,'2014-10-2')
, (1,'2014-10-3')
, (1,'2014-10-5')
, (1,'2014-10-7')
, (2,'2014-10-1')
, (2,'2014-10-2')
, (2,'2014-10-3')
, (2,'2014-10-5')
, (2,'2014-10-7');
I want to add a new column that is 'days in current streak'
so the result would look like:
pid | date | in_streak
-------|-----------|----------
1 | 2014-10-1 | 1
1 | 2014-10-2 | 2
1 | 2014-10-3 | 3
1 | 2014-10-5 | 1
1 | 2014-10-7 | 1
2 | 2014-10-2 | 1
2 | 2014-10-3 | 2
2 | 2014-10-4 | 3
2 | 2014-10-6 | 1
I've been trying to use the answers from
PostgreSQL: find number of consecutive days up until now
Return rows of the latest 'streak' of data
but I can't work out how to use the dense_rank() trick with other window functions to get the right result.

Building on this table (not using the SQL keyword "date" as column name.):
CREATE TABLE tbl(
pid int
, the_date date
, PRIMARY KEY (pid, the_date)
);
Query:
SELECT pid, the_date
, row_number() OVER (PARTITION BY pid, grp ORDER BY the_date) AS in_streak
FROM (
SELECT *
, the_date - '2000-01-01'::date
- row_number() OVER (PARTITION BY pid ORDER BY the_date) AS grp
FROM tbl
) sub
ORDER BY pid, the_date;
Subtracting a date from another date yields an integer. Since you are looking for consecutive days, every next row would be greater by one. If we subtract row_number() from that, the whole streak ends up in the same group (grp) per pid. Then it's simple to deal out number per group.
grp is calculated with two subtractions, which should be fastest. An equally fast alternative could be:
the_date - row_number() OVER (PARTITION BY pid ORDER BY the_date) * interval '1d' AS grp
One multiplication, one subtraction. String concatenation and casting is more expensive. Test with EXPLAIN ANALYZE.
Don't forget to partition by pid additionally in both steps, or you'll inadvertently mix groups that should be separated.
Using a subquery, since that is typically faster than a CTE. There is nothing here that a plain subquery couldn't do.
And since you mentioned it: dense_rank() is obviously not necessary here. Basic row_number() does the job.

You'll get more attention if you include CREATE TABLE statements and INSERT statements in your question.
create table test (
pid integer not null,
date date not null,
primary key (pid, date)
);
insert into test values
(1,'2014-10-1'), (1,'2014-10-2'), (1,'2014-10-3'), (1,'2014-10-5'),
(1,'2014-10-7'), (2,'2014-10-1'), (2,'2014-10-2'), (2,'2014-10-3'),
(2,'2014-10-5'), (2,'2014-10-7');
The principle is simple. A streak of distinct, consecutive dates minus row_number() is a constant. You can group by the constant, and take the dense_rank() over that result.
with grouped_dates as (
select pid, date,
(date - (row_number() over (partition by pid order by date) || ' days')::interval)::date as grouping_date
from test
)
select * , dense_rank() over (partition by grouping_date order by date) as in_streak
from grouped_dates
order by pid, date
pid date grouping_date in_streak
--
1 2014-10-01 2014-09-30 1
1 2014-10-02 2014-09-30 2
1 2014-10-03 2014-09-30 3
1 2014-10-05 2014-10-01 1
1 2014-10-07 2014-10-02 1
2 2014-10-01 2014-09-30 1
2 2014-10-02 2014-09-30 2
2 2014-10-03 2014-09-30 3
2 2014-10-05 2014-10-01 1
2 2014-10-07 2014-10-02 1

How to partition the following table in DB2

I am trying to partition and order the following table, where I have used all sorts of row_number() over() and dense_rank() over() combinations but am not getting what I need.
The MWE table is as follows:
Person Visit Last_Visit Gap_1_yr
------ ----- -------- --------
1 01/01/2001 01/01/2000 NULL
1 01/01/2003 01/01/2001 gap
1 01/01/2004 01/01/2003 NULL
1 01/01/2006 01/01/2004 gap
2 01/01/2005 01/01/2002 gap
2 01/01/2010 01/01/2005 gap
where a person turns up for an appointment, and if the persons next appointment is > 365 days from their previous appointment (I used a lag function for this).
What I want is, whenever there is a gap, I want to partition so I have the following:
Person Visit Last_Visit Gap_1_yr SEQ
------ ----- -------- -------- ---
1 01/01/2001 01/01/2000 NULL 1
1 01/01/2003 01/01/2001 gap 2
1 01/01/2004 01/01/2003 NULL 2
1 01/01/2006 01/01/2004 gap 3
2 01/01/2005 01/01/2002 gap 1
2 01/01/2010 01/01/2005 gap 2
You see that when there is a gap, the sequence iterates by one until the next gap - all per person.
I have tried:
row_number() over(partition by person order by gap)
but this iterates for every cell in SEQ until finding a new person -ignores gaps
and have tried:
dense_rank() over(partition by person order by gap)
returns 1's in every cell in SEQ
dense_rank() over(partition by person,gap order by gap)
also returns all 1's.
does anyone have any suggestions?

Convert the gap to a flag. Then use sum() to do a cumulative sum of the flag:
select mwe.*,
sum(case when gap_1_yr = 'gap' then 1 else 0 end) over
(partition by person order by visit)
) as seq
from mwe;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas